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Preface 



Theoretical computer science treats any computational subject for which a good model can be 
created. Research on formal models of computation was initiated in the 1930s and 1940s by 
Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, 
language translators, and operating systems were under development and therefore became 
both the subject and basis for a great deal of theoretical work. The power of computers of 
this period was limited by slow processors and small amounts of memory, and thus theories 
(models, algorithms, and analysis) were developed to explore the efficient use of computers as 
well as the inherent complexity of problems. The former subject is known today as algorithms 
and data structures, the latter computational complexity. 

The focus of theoretical computer scientists in the 1960s on languages is reflected in the 
first textbook on the subject, Formal Languages and Their Relation to Automata by John 
Hopcroft and Jeffrey Ullman. This influential book led to the creation of many language- 
centered theoretical computer science courses; many introductory theory courses today con- 
tinue to reflect the content of this book and the interests of theoreticians of the 1960s and early 
1970s. 

Although the 1970s and 1980s saw the development of models and methods of analysis 
directed at understanding the limits on the performance of computers, this attractive new 
material has not been made available at the introductory level. This book is designed to remedy 
this situation. 

This book is distinguished from others on theoretical computer science by its primary focus 
on real problems, its emphasis on concrete models of machines and programming styles, and 
the number and variety of models and styles it covers. These include the logic circuit, the finite- 
state machine, the pushdown automaton, the random-access machine, memory hierarchies, 
the PRAM (parallel random-access machine), the VLSI (very large-scale integrated) chip, and 
a variety of parallel machines. 
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The book covers the traditional topics of formal languages and automata and complexity 
classes but also gives an introduction to the more modern topics of space-time tradeoffs, mem- 
ory hierarchies, parallel computation, the VLSI model, and circuit complexity. These modern 
topics are integrated throughout the text, as illustrated by the early introduction of P-complete 
and NP-complete problems. The book provides the first textbook treatment of space-time 
tradeoffs and memory hierarchies as well as a comprehensive introduction to traditional com- 
putational complexity. Its treatment of circuit complexity is modern and substantative, and 
parallelism is integrated throughout. 



Plan of the Book 

The book has three parts. Part I (Chapter 1) is an overview. Part II, consisting of Chapters 2—7, 
provides an introduction to general computational models. Chapter 2 introduces logic circuits 
and derives upper bounds on the size and depth of circuits for important problems. The finite- 
state, random-access, and Turing machine models are defined in Chapter 3 and circuits are 
presented that simulate computations performed by these machines. From such simulations 
arise results of two kinds. First, computational inequalities of the form C(f) < kST are 
derived for problems / run on the random-access machine, where C(/) is the size of the 
smallest circuit for /, k is a constant, and S and T are storage space and computation time. 
If ST is too small relative to C(f), the problem / cannot be solved. Second, the same circuit 
simulations are interpreted to identify P-complete and NP-complete problems. P-complete 
problems can all be solved in polynomial time but are believed hard to solve fast on parallel 
machines. The NP-complete problems include many important scheduling and optimization 
problems and are believed not solvable in polynomial time on serial machines. 

Part II also contains traditional material on formal languages and automata. Chapter 4 
explores the connection between two machine models (the finite-state machine and the push- 
down automaton) and language types in the Chomsky hierarchy. Chapter 5 examines Turing 
machines. It shows that the languages recognized by them are the phrase-structure languages, 
the most expressive of the language types in the Chomsky hierarchy. This chapter also exam- 
ines universal Turing machines, reducibility, unsolvable problems, and the functions computed 
by Turing machines. 

Finally, Part II contains Chapters 6 and 7 which introduce algebraic and combinatorial 
circuits and parallel machine models, respectively. Algebraic and combinatorial circuits are 
graphs of straight-line programs of the kind typically used for matrix multiplication and in- 
version, solving linear systems of equations, computing the fast Fourier transform, performing 
convolutions, and merging and sorting. Chapter 6 contains reference material on problems 
used in later chapters to illustrate models and lower-bound arguments. Parallel machine mod- 
els such as the PRAM and networks of computers organized as meshes and hypercubes are 
studied in Chapter 7. A framework is given for the design of algorithms and derivation of 
lower bounds on performance. 

Part III, a comprehensive treatment of computational complexity, consists of Chapters 8— 
12. Chapter8 provides a comprehensive survey of traditional computational complexity. Using 
serial and parallel machine models, it examines time- and space-bounded complexity classes, 
including the P-complete, NP-complete and PSPACE-complete languages as well as the circuit 
complexity classes NC and P/poly. This chapter also establishes the connections between de- 
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terministic and nondeterministic space complexity classes and shows that the nondeterministic 
space classes are closed under complements. 

Circuit complexity is the topic of Chapter 9. Methods for deriving lower bounds on circuit 
size and depth are given for general circuits, formulas, monotone circuits, and bounded-depth 
circuits. This modern treatment of circuit complexity complements Chapter 2, which derives 
tight upper bounds on circuit size and depth. 

Space-time tradeoffs are studied in Chapter 10 using two computational models, the 
branching program and the pebble game, which capture the notions of space and time for 
many programs for which branching is and is not allowed, respectively. Methods for deriving 
lower bounds on the exchange of space for time are presented and applied to a representative 
set of problems. 

Chapter 1 1 examines models for memory hierarchy systems. It uses the pebble game with 
pebbles of multiple colors to designate storage locations at different levels of a hierarchy, and 
also employs block and RAM-based models. Again, lower bounds on performance are derived 
and compared with the performance of algorithms. This chapter also has a brief treatment of 
the LRU and FIFO memory-management algorithms that uses competitive analysis to com- 
pare their performance to that of the optimal algorithm. 

The book closes with Chapter 12 on the VLSI model for integrated circuits. In this model 
both chip area A and time T are important, and methods are given for deriving lower bounds 
on measures such as AT 2 . Chip layouts and VLSI algorithms are also exhibited whose perfor- 
mance comes close to matching the lower bounds. 



Use of the Book 

Many different courses can be designed around this book. A core undergraduate computer 
science course can be taught using Parts I and II and some material from Chapter 8. The 
first course on theoretical computer science for majors at Brown uses most of Chapters 1-5 
except for the advanced material in Chapters 2 and 3. It uses a few elementary sections from 
Chapters 10 and 1 1 to emphasize space-time tradeoffs, which play a central role in Chapter 3 
and lead into the study of formal languages and automata in Chapter 4. After covering the 
material of Chapter 5, a few lectures are given on NP-complete problems from Chapter 8. 

This introductory course has four programming assignments in Scheme that illustrate the 
ideas embodied in Chapters 2, 3 and 5. The first program solves the circuit- value problem, 
that is, it executes a straight-line program, thereby producing the outputs defined by this 
program. The second program writes a straight-line program simulating T steps by a finite- 
state machine. The third program writes a straight-line program simulating T steps by a 
one-tape Turing machine (this is the reduction involved in the Cook-Levin theorem) and the 
fourth one simulates a universal Turing machine. 

Several different advanced courses can be assembled from the material of Part III and 
introductory material of Part II. For example, a course on concrete computational complexity 
can be assembled around Chapters 1 and 1 1 , which examine tradeoffs between space and 
time in primary and secondary memory. This course would presume or include introductory 
material from Chapter 3. 

An advanced course emphasizing traditional computational complexity can be based pri- 
marily on computability (Chapter 5) and complexity classes (Chapter 8) and some material on 
circuit complexity from Chapter 9. 
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An advanced course on circuit complexity can be assembled from Chapter 2 on logic cir- 
cuits and Chapter 9 on circuit complexity. The former describes efficient circuits for a variety 
of functions while the latter surveys methods for deriving lower bounds to circuit complexity. 

The titles of sections containing advanced material carry an asterisk. 
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Models of Computation 

Exploring the Power of Computing 



Parti 
OVERVIEW OF THE BOOK 



CHAPTER 



1 

The Role of Theory in 
Computer Science 



Computer science is the study of computers and programs, the collections of instructions that 
direct the activity of computers. Although computers are made of simple elements, the tasks 
they perform are often very complex. The great disparity between the simplicity of computers 
and the complexity of computational tasks offers intellectual challenges of the highest order. It 
is the models and methods of analysis developed by computer science to meet these challenges 
that are the subject of theoretical computer science. 

Computer scientists have developed models for machines, such as the random-access and 
Turing machines; for languages, such as regular and context-free languages; for programs, such 
as straight-line and branching programs; and for systems of programs, such as compilers and 
operating systems. Models have also been developed for data structures, such as heaps, and for 
databases, such as the relational and object-oriented databases. 

Methods of analysis have been developed to study the efficiency of algorithms and their 
data structures, the expressibility of languages and the capacity of computer architectures to 
recognize them, the classification of problems by the time and space required to solve them, 
their inherent complexity, and limits that hold simultaneously on computational resources for 
particular problems. This book examines each of these topics in detail except for the first, 
analysis of algorithms and data structures, which it covers only briefly. 

This chapter provides an overview of the book. Except for the mathematical preliminaries, 
the topics introduced here are revisited later. 
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I.I A Brief History of Theoretical Computer Science 

Theoretical computer science uses models and analysis to study computers and computation. 
It thus encompasses the many areas of computer science sufficiently well developed to have 
models and methods of analysis. This includes most areas of the field. 

1.1.1 Early Years 

TURING AND CHURCH: Theoretical computer science emerged primarily from the work of 
Alan Turing and Alonzo Church in 1936, although many others, such as Russell, Hilbert, and 
Boole, were important precursors. Turing and Church introduced formal computational mod- 
els (the Turing machine and lambda calculus), showed that some well-stated computational 
problems have no solution, and demonstrated the existence of universal computing machines, 
machines capable of simulating every other machine of their type. 

Turing and Church were logicians; their work reflected the concerns of mathematical logic. 
The origins of computers predate them by centuries, going back at least as far as the abacus, if 
we call any mechanical aid to computation a computer. A very important contribution to the 
study of computers was made by Charles Babbage, who in 1836 completed the design of his 
first programmable Analytical Engine, a mechanical computer capable of arithmetic operations 
under the control of a sequence of punched cards (an idea borrowed from the Jacquard loom). 
A notable development in the history of computers, but one of less significance, was the 1938 
demonstration by Claude Shannon that Boolean algebra could be used to explain the operation 
of relay circuits, a form of electromechanical computer. He was later to develop his profound 
"mathematical theory of communication" in 1948 as well as to lay the foundations for the 
study of circuit complexity in 1949. 

FIRST COMPUTERS: In 1941 Konrad Zuse built the Z3, the first general-purpose program- 
controlled computer, a machine constructed from electromagnetic relays. The Z3 read pro- 
grams from a punched paper tape. In the mid- 1940s the first programmable electronic com- 
puter (using vacuum tubes), the ENIAC, was developed by Eckert and Mauchly. Von Neu- 
mann, in a very influential paper, codified the model that now carries his name. With the 
invention of the transistor in 1947, electronic computers were to become much smaller and 
more powerful than the 30-ton ENIAC. The microminiaturization of transistors continues 
today to produce computers of ever-increasing computing power in ever-shrinking packages. 

EARLY LANGUAGE DEVELOPMENT: The first computers were very difficult to program (cables 
were plugged and unplugged on the ENIAC). Later, programmers supplied commands by 
typing in sequences of O's and Is, the machine language of computers. A major contribution 
of the 1950s was the development of programming languages, such as FORTRAN, COBOL, 
and LISP. These languages allowed programmers to specify commands in mnemonic code and 
with high level constructs such as loops, arrays, and procedures. 

As languages were developed, it became important to understand their expressiveness as 
well as the characteristics of the simplest computers that could translate them into machine 
language. As a consequence, formal languages and the automata that recognize them became 
an important topic of study in the 1950s. Nondeterministic models - models that may have 
more than one possible next state for the current state and input - were introduced during this 
time as a way to classify languages. 
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1.1.2 1950s 

FINITE-STATE MACHINES: Occurring in parallel with the development of languages was the 
development of models for computers. The 1950s also saw the formalization of the finite-state 
machine (also called the sequential machine), the sequential circuit (the concrete realization of 
a sequential machine), and the pushdown automaton. Rabin and Scott pioneered the use of 
analytical tools to study the capabilities and limitations of these models. 

FORMAL LANGUAGES: The late 1950s and 1960s saw an explosion of research on formal lan- 
guages. By 1964 the Chomsky language hierarchy, consisting of the regular, context-free, 
context-sensitive, and recursively enumerable languages, was established, as was the correspon- 
dence between these languages and the memory organizations of machine types recognizing 
them, namely the finite-state machine, the pushdown automaton, the linear-bounded au- 
tomaton, and the Turing machine. Many variants of these standard grammars, languages, 
and machines were also examined. 

1.1.3 1960s 

COMPUTATIONAL COMPLEXITY: The 1960s also saw the laying of the foundation for compu- 
tational complexity with the classification of languages and functions by Hartmanis, Lewis, 
and Stearns and others of the time and space needed to compute them. Hierarchies of prob- 
lems were identified and speed-up and gap theorems established. This area was to flower and 
lead to many important discoveries, including that by Cook (and independently Levin) of 
NP-complete languages, languages associated with many hard combinatorial and optimiza- 
tion problems, including the Traveling Salesperson problem, the problem of determining the 
shortest tour of cities for which all intercity distances are given. Karp was instrumental in 
demonstrating the importance of NP-complete languages. Because problems whose running 
time is exponential are considered intractable, it is very important to know whether a string in 
NP-complete languages can be recognized in a time polynomial in their length. This is called 

the P = NP problem, where P is the class of deterministic polynomial-time languages. The 
P-complete languages were also identified in the 1970s; these are the hardest languages in P to 
recognize on parallel machines. 

1.1.4 1970s 

COMPUTATION TIME AND CIRCUIT COMPLEXITY: In the early 1970s the connection between 

computation time on Turing machines and circuit complexity was established, thereby giving 

? 
the study of circuits renewed importance and offering the hope that the P = NP problem 

could be resolved via circuit complexity. 

PROGRAMMING LANGUAGE SEMANTICS: The 1970s were a very productive period for formal 
methods in the study of programs and languages. The area of programming language seman- 
tics was very active; models and denotations were developed to give meaning to the phrase 
"programming language," thereby putting language development on a solid footing. Formal 
methods for ensuring the correctness of programs were also developed and applied to program 
development. The 1970s also saw the emergence of the relational database model and the 
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development of the relational calculus as a means for the efficient reformulation of database 
queries. 

SPACE-TIME TRADEOFFS: An important byproduct of the work on formal languages and se- 
mantics in the 1970s is the pebble game. In this game, played on a directed acyclic graph, 
pebbles are placed on vertices to indicate that the value associated with a vertex is located in 
the register of a central processing unit. The game allows the study of tradeoffs between the 
number of pebbles (or registers) and time (the number of pebble placements) and leads to 
space-time product inequalities for individual problems. These ideas were generalized in the 
1980s to branching programs. 

VLSI MODEL: When the very large-scale integration (VLSI) of electronic components onto 
semiconductor chips emerged in the 1970s, VLSI models for them were introduced and an- 
alyzed. Ideas from the study of pebble games were applied and led to tradeoff inequalities 
relating the complexity of a problem to products such as AT 2 , where A is the area of a chip 
and T is the number of steps it takes to solve a problem. In the late 1970s and 1980s the 
layout of computers on VLSI chips also became an important research topic. 

ALGORITHMS AND DATA STRUCTURES: While algorithms (models for programs) and data struc- 
tures were introduced from the beginning of the field, they experienced a flowering in the 
1 970s and 1 980s. Knuth was most influential in this development, as later were Aho, Hopcroft, 
and Ullman. New algorithms were invented for sorting, data storage and retrieval, problems on 
graphs, polynomial evaluation, solving linear systems of equations, computational geometry, 
and many other topics on both serial and parallel machines. 

1.1.5 1980s and 1990s 

PARALLEL COMPUTING AND I/O COMPLEXITY: The 1980s also saw the emergence of many 
other theoretical computer science research topics, including parallel and distributed comput- 
ing, cryptography, and I/O complexity. A variety of concrete and abstract models of parallel 
computers were developed, ranging from VLSI-based models to the parallel random-access 
machine (PRAM), a collection of synchronous processors alternately reading from and writ- 
ing to a common array of memory cells and computing locally. Parallel algorithms and data 
structures were developed, as were classifications of problems according to the extent to which 
they are parallelizable. I/O complexity, the study of data movement among memory units 
in a memory hierarchy, emerged around 1980. Memory hierarchies take advantage of the 
temporal and spatial locality of problems to simulate fast, expensive memories with slow and 
inexpensive ones. 

DISTRIBUTED COMPUTING: The emergence of networks of computers brought to light some 
hard logical problems that led to a theory of distributed computing, that is, computing with 
multiple and potentially asynchronous processors that may be widely dispersed. The prob- 
lems addressed in this area include reaching consensus in the presence of malicious adversaries, 
handling processor failures, and efficiently coordinating the activities of agents when interpro- 
cessor latencies are large. Although some of the problems addressed in distributed computing 
were first introduced in the 1950s, this topic is associated with the 1980s and 1990s. 
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CRYPTOGRAPHY: While cryptography has been important for ages, it became a serious con- 
cern of complexity theorists in the late 1970s and an active research area in the 1980s and 
1 990s. Some of the important cryptographic issues are a) how to exchange information se- 
cretly without having to exchange a private key with each communicating agent, b) how to 
identify with high assurance the sender of a message, and c) how to convince another agent 
that you have the solution to a problem without transferring the solution to him or her. 

As this brief history illustrates, theoretical computer science speaks to many different com- 
putational issues. As the range of issues addressed by computer science grows in sophistication, 
we can expect a commensurate growth in the richness of theoretical computer science. 

1.2 Mathematical Preliminaries 

In this section we introduce basic concepts used throughout the book. Since it is presumed 
that the reader has already met most of this material, this presentation is abbreviated. 

1.2.1 Sets 

A set A is a non-repeating and unordered collection of elements. For example, A§q s = 
{Cobol, Fortran, Lisp} is a set of elements that could be interpreted as the names of languages 
designed in the 1950s. Because the elements in a set are unordered, {Cobol, Fortran, Lisp} 
and {Lisp, Cobol, Fortran} denote the same set. It is very convenient to recognize the empty 
set 0, a set that does not have any elements. The set B = {0, 1} containing and 1 is used 
throughout this book. 

The notation a G A means that element a is contained in set A. For example, Cobol G 
Asos means that Cobol is a language invented in the 1950s. A set can be finite or infinite. The 
cardinality of a finite set A, denoted \A\, is the number of elements in A. We say that a set A 
is a subset of a set B, denoted A C B, if every element of A is an element of B, If A C B 
but B contains elements not in A, we say that A is a proper subset and write A C B. 

The union of two sets A and B, denoted A U B, is the set containing elements that 
are in A, B or both. For example, if A Q = {1,2, 3} and B = {4, 3, 5}, then A U B = 
{5, 4, 3, 1,2}. The intersection of sets A and B, denoted AC\B, is the set containing elements 
that are in both A and B. Hence, Aq C\ Bq = {3}. If A and B have no elements in common, 
denoted A f~l B = 0, they are said to be disjoint sets. The difference between sets A and 
B, denoted A — B, is the set containing the elements that are in A but not in B. Thus, 
A -B = {1,2}. (See Fig. 1.1.) 



A y^ ' 


^x^ "X s 




AinB\ 


A-B^ 





Figure I.I A Venn diagram showing the intersection and difference of sets A and B. Their 
union is the set of elements in both A and B. 
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The following simple properties hold for arbitrary sets A and B and the operations of set 
union, intersection, and difference: 

AUB = BliA 

AC\B = BC\A 

AU0 = A 

An0 = 

A-% = A 

The power set of a set A, denoted 2 , is the set of all subsets of A including the empty 
set. For example, 2^ 2 ' 5 -^ = {0, {2}, {5}, {9}, {2, 5}, {2, 9}, {5, 9}, {2, 5, 9}}. We use 2 A 
to denote the power set A as a reminder that it has 2' ' elements. To see this, observe that 
for each subset B of the set A there is a binary n-tuple (ei, e.%, . . ■ , e\A\) where e^ is 1 if the 
ith element of A is in B and otherwise. Since there are 2' ' ways to assign O's and l's to 
(ei, e2, . . . , e\A\)> 2^ has 2> A > elements. 

The Cartesian product of two sets A and B, denoted A x B, is another set, the set of pairs 
{(a, b) \ae A, be B}. For example, when A = {1,2,3} and B = {4,3,5}, A x B = 
{(1,4), (1,3), (1,5), (2, 4), (2,3), (2,5), (3, 4), (3,3), (3,5)}. The Cartesian product of fc 
sets A\, A2, . . . , Ak, denoted A\ x^4 2 x ■ ■ -xAk, is the set of fc-tuples {(0,1,0,2, ■ . ■ , a,}.) | &i € 
A\, 02 G A2, . . . , Ofe € Ak} whose components are drawn from the respective sets. If for 
each 1 < i < k, Ai = A, the fc-fold Cartesian product A\ x A2 x ■ • • x Ak is denoted 
A . An element of A is a fc-tuple (a,\, 0,2, . . . , afc) where at G A. Thus, the binary n-tuple 
(ei, e.2, ■ ■ ■ , eui) of the preceding paragraph is an element of {0, 1}". 

1.2.2 Number Systems 

Integers are widely used to describe problems. The infinite set IN consisting of and the 
positive integers {1,2, 3, . . .} is called the set of natural numbers. The set of positive and 
negative integers and zero, 7L, consists of the integers {0, 1, —1, 2, —2, . . .}. 

In the standard decimal representation of the natural numbers, each integer n is repre- 
sented as a sum of powers of 10. For example, 867 = 8 x 10 2 + 6 x 10 1 + 7 x 10°. Since 
computers today are binary machines, it is convenient to represent integers over base 2 instead 
of 10. The standard binary representation for the natural numbers represents each integer as 
a sum of powers of 2. That is, for some fc > each integer n can be represented as a fc-tuple 
x = (xk-i, Xk-2> ■ ■ ■ >X\, Xo), where each of Xk-i, Xk-2> ■ ■ ■ > X\,Xg has value or 1 and n 
satisfies the following identity: 

n = a; fc _i2 fc - 1 + x k - 2 2 k ~ 2 + h x x 2 l + x 2° 

The largest integer that can be represented with k bits is 2 k ~ l + 2 k ~ 2 + ■ ■ • + 2 1 + 2° = 
2 — 1 . (See Problem 1.1.) Also, the fc-tuple representation for n is unique; that is, two 
different integers cannot have the same representation. When leading O's are suppressed, the 
standard binary representation for 1, 15, 32, and 97 are (1), (1, 1, 1, 1), (1, 0, 0, 0, 0, 0), and 
(1,1,0,0,0,0, 1), respectively. 

We denote with x + y, x — y, x * y, and x/y the results of addition, subtraction, multi- 
plication, and division of integers. 
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1.2.3 Languages and Strings 

An alphabet A is a finite set with at least two elements. A string x is an element (a\ , a 2 , . . ■ ,a,k) 
of the Cartesian product A in which we drop the commas and parentheses. Thus, we write 
x = a,\a 2 ■ ■ ■ afc, and say that a; is a string over the alphabet A. A string x in A k is said to 
have length k, denoted \x\ = k. Thus, Oil is a string of length three over A = {0, 1}. 

Consider now the Cartesian product A k x A 1 = A k+l , which is the (fc + Z)-fold Cartesian 
product of A with itself. Let X = a\a 2 ■ ■ ■ Ofc € A k and y = b\b 2 ■ ■ ■ hi G A 1 . Then a string 
Z = C\Oi ■ ■ ■ Ck+i € A + can be written as the concatenation of strings x and y of length k 
and I, denoted, z = x ■ y, where 

x ■ y = a x a 2 ■ ■ ■ a k b x b 2 ■ ■ ■ bi 

That is, Ci = Oj for 1 < i < k and C; = &i_/j for k + 1 < i < k + I. 

The empty string, denoted e, is a special string with the property that when concatenated 
with any other string x it returns x; that is, x ■ e = ex = x. The empty string is said to have 
zero length. As a special case of A , we let A denote the set containing the empty string; 
thatis,A° = {e}. 

The concatenation of sets of strings A and B, denoted A ■ B, is the set of strings formed 
by concatenating each string in A with each string in B. For example, {00, 1} • {a, bb} = 
{00a, 0066, la, 166}. The concatenation of a set A with the empty set 0, denoted A ■ 0, is the 
empty set because it contains no elements; that is, 



When no confusion arises, we write AB instead of A ■ B. 

A language L over an alphabet A is a collection of strings of potentially different lengths 
over A. For example, {00, 010, 1 110, 1001} is a finite language over the alphabet {0, 1}. (It 
is finite because it contains a bounded number of strings.) The set of all strings of all lengths 
over the alphabet A, including the empty string, is denoted A* and called the Kleene closure 
of A. For example, {0}* contains e, the empty string, as well as 0, 00, 000, 0000, .... Also, 
{00 U 1}* = {e, 1, 00, 001, 100, 0000, . . .}. It follows that a language L over the alphabet A 
is a subset of A* , denoted L C A* . 

The positive closure of a set A, denoted A + , is the set of all strings over A except for 
the empty string. For example, 0(0*10*) + is the set of binary strings beginning with and 
containing at least one 1 . 

1.2.4 Relations 

A subset R of the Cartesian product of sets is called a relation. A binary relation R is a 

subset of the Cartesian product of two sets. Three examples of binary relations are i?o = 
{(0,0), (1,1), (2,4), (3,9), (4, 16)}, R x = {(red, 0), (green, 1), (blue, 2)}, and R 2 = 
{(small, short), (medium, middle), (medium, average), (large, tall)}. The relation Rq is a 
function because for each first component of a pair there is a unique second component. R\ 
is also a function, but R 2 is not a function. 

A binary relation R over a set A is a subset of A x A; that is, both components of each 
pair are drawn from the same set. We use two notations to denote membership of a pair (a, 6) 
in a binary relation R over A, namely (a, 6) G R and the new notation aRb. Often it is more 
convenient to say aRb than to say (a, 6) £ R. 
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A binary relation R is reflexive if for all a & A, aRa. It is symmetric if for all a, b G A, 
aRb if and only \ibRa. It is transitive if for all a, 6, c G A, if aRb and bRc, then aRc. 

A binary relation R is an equivalence relation if it is reflexive, symmetric, and transitive. 
For example, the pairs (a, b), a,b G IN, for which both a and b have the same remainder on 
division by 3, is an equivalence relation. (See Problem 1.3.) 

If R is an equivalence relation and aRb, then a and b are said to be equivalent elements. 
We let E[a] be the set of elements in A that are equivalent to a under the relation R and 
call it the equivalence class of elements equivalent to a. It is not difficult to show that for all 
a, b G A, E[a] and E[b] are either equal or disjoint. (See Problem 1.4.) Thus, the equivalence 
classes of an equivalence relation over a set A partition the elements of A into disjoint sets. 
For example, the partition {0*, 0(0*10*) + , 1(0 + 1)*} of the set (0 + 1)* of binary strings 
defines an equivalence relation R. The equivalence classes consist of strings containing zero or 
more 0's, strings starting with and containing at least one 1, and strings beginning with 1. It 
follows that 007?000 and 1001 R 11 hold but not 10i?01. 

1.2.5 Graphs 

A directed graph G = (V, E) consists of a finite set V of distinct vertices and a finite set 
of pairs of distinct vertices E C V x V called edges. Edge e is incident on vertex v if e 
contains v. A directed graph is undirected if for each edge (v\,V2) in E the edge (v2,v{) 
is also in E. Figure 1.2 shows two examples of directed graphs, some of whose vertices are 
labeled with symbols denoting gates, a topic discussed in Section 1.2.7. In a directed graph 
the edge (wi, v 2 ) is directed from the vertex V\ to the vertex v 2 , shown with an arrow from V\ 
to v 2 . The in-degree of a vertex in a directed graph is the number of edges directed into it; its 
out-degree is the number of edges directed away from it; its degree is the sum of its in- and 
out-degree. In a directed graph an input vertex has in-degree zero, whereas an output vertex 
either has out-degree zero or is simply any vertex specially designated as an output vertex. A 
walk in a graph (directed or undirected) is a tuple of vertices (vi, v 2 , . . . , v p ) with the property 
that (vi, «i+i) is in E for 1 < i < p — 1. A walk (v\, v 2 , . . . ,v p ) is closed if Vi = v p . A path 
is a walk with distinct vertices. A cycle is a closed walk with p — 1 distinct vertices, p > 3. 
The length of a path is the number of edges on the path. Thus, the path (y\, v 2 , ■ ■ ■ , v p ) has 
length p — 1. A directed acyclic graph (DAG) is a directed graph that has no cycles. 



A0 W5 IV 6 \ ©_ # «7 
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v\ W v 2 



vi W W v 2 

(a) (b) 

Figure 1 .2 Two directed acyclic graphs representing logic circuits. 
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Logic circuits are DAGs in which all vertices except input vertices carry the labels of gates. 
Input vertices carry the labels of Boolean variables, variables assuming values over the set 
B = {0, 1}. The graph of Fig. 1.2(a) is the logic circuit of Fig. 1.3(c), whereas the graph 
of Fig. 1.2(b) is the logic circuit of Fig. 1.4. (The figures are shown in Section 1.4.1, Logic 
Circuits.) The set of labels of logic gates used in a DAG is called the basis Q for the DAG. The 
size of a circuit is the number of non-input vertices that it contains. Its depth is the length of 
the longest directed path from an input vertex to an output vertex. 



1.2.6 Matrices 

An m x n matrix is an array of elements containing m rows and n columns. (See Chapter 6.) 
The adjacency matrix of a graph G with n vertices is an n x n matrix whose entries are or 
1. The entry in the ith row and jth column is 1 if there is an edge from vertex i to vertex j 
and otherwise. The adjacency matrix A for the graph in Fig. 1.2(a) is 

10 
10 
1 
1 
^00000 

1.2.7 Functions 

The engineering component of computer science is concerned with the design, development, 
and testing of hardware and software. The theoretical component is concerned with questions 
of feasibility and optimality. For example, one might ask if there exists a program H that can 
determine whether an arbitrary program P on an arbitrary input I will halt or not. This is 
an example of an unsolvable computational problem. While it is a fascinating topic, practice 
often demands answers to less ethereal questions, such as "Can a particular problem be solved 
on a general-purpose computer with storage space S in T steps?" 

To address feasibility and optimality it is important to have a precise definition of the tasks 
under examination. Functions serve this purpose. A function (or mapping) / : D i— > TZ is 
a relation / on the Cartesian product T> x TZ subject to the requirement that for each d G D 
there is at most one pair (d, r) in /. If (d, r) G /, we say that the value of/ on d is r, denoted 
f(d) = r. The domain and codomain of/ are D and TZ, respectively. The sets T> and 1Z can 
be finite or infinite. For example, let / mu it : ]N 2 i— > IN of domain T> = IN 2 and codomain 
TZ = IN map a pair of natural numbers x and y (IN = {0, 1, 2, 3, . . .}) into their product z; 
that is, f(x, y) = z = x*y.A function / : V i— > TZ is partial if for some d G V no value 
in TZ is assigned to f(d). Otherwise, a function is complete. 

If the domain of a function is the Cartesian product of n sets, the function is said to have 
n input variables. If the codomain of a function is the Cartesian product of m sets, the 
function is said to have m output variables. If the input variables of such a function are all 
drawn from the set A and the output variables are all drawn from the set B, this information 
is often captured by the notation fy n - m > ; A n i— > B m . However, we frequently do not use 
exponents or we use only one exponent to parametrize a class of problems. 
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A finite function is one whose domain and codomain are both finite sets. Finite functions 
can be completely defined by tables of pairs {(d, r)}, where d is an element of its domain and 
r is the corresponding element of its codomain. 

Binary functions are complete finite functions whose domains and codomains are Carte- 
sian products over the binary set B = {0, 1}. Boolean functions are binary functions whose 
codomain is B. The tables below define three Boolean functions on two input variables and 
one Boolean function on one input variable. They are called truth tables because the values 1 
and are often associated with the values True and False, respectively. 
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The above tables define the AND function x Ay (its value is True when x and y are True), 
the OR function x V y (its value is True when either x or y or both are True), the EXCLUSIVE 
OR function x © y (its value is True only when either x or y is True, that is, when x is 
True and y is False and vice versa), and the NOT function x (its value is True when x is 



False and vice versa). The notation /, 
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/V : B i— > B for these functions makes explicit their number of input and output variables. 
We generally suppress the second superscript when functions are Boolean. The physical devices 
that compute the AND, OR, NOT, and EXCLUSIVE OR functions are called gates. 

Many computational problems are described by functions / : A* i— > C* from the (un- 
bounded) set of strings over an alphabet A to the set of strings over a potentially different 
alphabet C. Since the letters in every finite alphabet A can be encoded as fixed-length strings 
over the binary alphabet B = {0, 1}, there is no loss of generality in assuming that functions 
are mappings / : B* i— > B* , that is, from strings over B to strings over B. 

Functions with unbounded domains can be used to identify languages. A language L over 
the alphabet A is uniquely determined by a characteristic function / : A* i— > B with the 
property that L = {x \ x 6 A* such that f(x) = 1}. This statement means that L is the set 
of strings x in A* for which / on them, namely f(x), has value 1. 

We often restrict a function / : B* (— > B* to input strings of length n, n arbitrary. The 
domain of such a function is B n . Its codomain consists of those strings into which strings of 
length n map. This set may contain strings of many lengths. It is often convenient to map 
strings of length n to strings of a fixed length containing the same information. This can be 
done as follows. Let h(n) be the length of a longest string that is the value of an input string 
of length n. Encode letters in B by repeating them (replace by 00 and 1 by 1 1) and then add 
as a prefix as many instances of 1 as necessary to insure that each string in the codomain of 
/„ has 2h(n) characters. For example, if h(4) = 3 and /(01 10) = 10, encode the value 10 as 
01 1 100. This encoding provides a function /„ : B n i— > B ^ n ' containing all the information 
that is in the original version of f n . 

It is often useful to work with functions / : IR (— > IR whose domains and codomains are 
real numbers IR. Functions of this type include linear functions, polynomials, exponentials, 
and logarithms. A polynomial p(x) : IR i— ► IR of degree k — 1 in the variable x is specified 
by a set of A; real coefficients, Ck-i, ■ ■ ■ ,C\,Cq, where p(x) = Ck-iX + • • ■ + Cix 1 + cq. 
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A linear function is a polynomial of degree 1 . An exponential function is a function of the 
form E(x) = a x for some real a — for example, 2 = 2.8284271 .... The logarithm to the 
base a of b, denoted log Q b, is the value of x such that a x = b. For example, the logarithm to 
base 2 of 2.8284271 ... is 1.5 and the logarithm to base 10 of 100 is 2. A function f(x) is 
polylogarithmic if for some polynomial p(x) we can write f[x) as p(log 2 x); that is, it is a 
polynomial in the logarithm of x. 

Two other functions used often in this book are the floor and ceiling functions. Their 
domains are the reals, but their codomains are the integers. The ceiling function, denoted 
\x~\ : R i-» S, maps the real x to the smallest integer greater or equal to it. The floor 
function, denoted \x\ : Ei-> S, maps the real x to the largest integer less than or equal to 
it. Thus, [3.51 = 4 and [15.0001] = 16. Similarly, [3.5J = 3 and [15.0001J = 15. The 
following bounds apply to the floor and ceiling functions. 

fix) - 1 < [f(x)\ < f(x) 

f{x) < r/f»i < f( x ) + i 

As an example of the application of the ceiling function we note that [log 2 n\ is the number 
of bits necessary to represent the integer n. 

1.2.8 Rate of Growth of Functions 

Throughout this book we derive mathematical expressions for quantities such as space, time, 
and circuit size. Generally these expressions describe functions / : N i-> E from the non- 
negative integers to the reals, such as the functions f\ (n) and fzin) defined as 

f\(n) = 4.5n 2 + 3n 
/ 2 (n) = 3" + 4.5n 2 

When n is large we often wish to simplify expressions such as these to make explicit their 
dominant or most rapidly growing term. For example, for large values of n the dominant terms 
in fi{n) and fi(n) are 4.5n and 3 n respectively, as we show. A term dominates when n is 
large if the value of the function is approximately the value of this term, that is, if the function 
is within some multiplicative factor of the term. 

To highlight dominant terms we introduce the big Oh, big Omega and big Theta no- 
tation. They are defined for functions whose domains and codomains are the integers or the 
reals. 

DEFINITION 1 .2. 1 Let f : IR \— > IR and g : IR i— > IR be two functions whose domains and 
codomains are either the integers or the reals. If there are positive constants Xo and K > such 
that for all \x\ > Xq, 

\f(x)\<K\g(x)\ 

we write 

f(x) = 0(g(x)) 

and say that "fix) is big Oh of g(x) "or it grows no more rapidly in x than gix). Under the 
same conditions we also write 

gix)=n(fix)) 
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and say that "g(x) is big Omega of f(x) "or that it grows at least as rapidly in x as f(x). 
Iff(x) = 0(g(x)) andg(x) = 0(f(x)), we write 

f(x) = Q(g(x))org(x) = Q(f(x)) 

and say that "f(x) is big Theta of g(x) " and "g{x) is big Theta of f(x) " or that the two 
functions have the same rate of growth in x. 

The big Oh notation is illustrated by the expressions for fi(n) and fzin) above. 

EXAMPLE 1 .2. 1 We show that / x (n) = A.5n 2 + 3n is 0(n k ) for any k > 2; that is, / x (n) 
grows no more rapidly than n " for k > 2. We also show that n = 0{f\ (n)) for k < 2; that 
is, that n grows no more rapidly than f\{n) for k < 2. From the above definitions it follows 
that f\ (n) = 0(n 2 ); that is, /j (n) and n 2 have the same rate of growth. We say that f\ (n) is a 
quadratic function in n. 

To prove the first statement, we need to exhibit a natural number tiq and a constant Kq > 
such that for all n > n , f\(n) < K n . If we can show that f\{n) < K^n 2 , then we have 
shown fi(n) < K n for all k > 2. To show the former, we must show the following for some 
K > and for all n > n : 

4.5n 2 + 3n < K n 2 

We try Kg = 5-5 and find that the above inequality is equivalent to 3n < n 2 or 3 < n. Thus, we 
can choose no = 3 and we are done. 

To prove the second statement, namely, that n = 0{f\{n)) for k < 2, we must exhibit a 
natural number tii and some K\ > such that for all n > ny n < K2fi(n). If we can show 
that n 2 < K\ f\ (n), then we have shown n < ifa/i i n )- To show the former, we must show the 
following for some K \ > and for all n > n\: 

n 2 < if 1 (4.5n 2 + 3n) 

Clearly, if K\ = 1/4.5 the inequality holds for n > 0, since 3K\n is positive. Thus, we choose 
n\ = and we are done. 

EXAMPLE 1 .2.2 We now show that the slightly more complex function fi{n) = 3" + 4.5n 
grows as 3"; that is, fi{n) = 0(3"), an exponential function in n. Because 3™ < fi(n) for 
alln > 0, it follows that 3™ = 0(/2(n)). To show that fc(n) = 0(3"), we demonstrate that 
fi{ n ) 5: 2(3") holds for n > 4. This is equivalent to the following inequality: 

4.5n 2 < 3™ 

To prove this holds, we show that h(n) = 3"/n is an increasing function of n for n > 2 
and that h(4) > 4.5. To show that h(n) is an increasing function ofn, we compute the ratio 
r(n) = h{n + 1) /h(n) and show that r(n) > 1 for n > 2. But r{n) = 3n 2 /(n + l) 2 and 
r(n) > 1 when 3n 2 > (n + l) 2 or when n{n — 1) > 1/2, which holds for n > 2. Since 
h(3) = 3 and h(4) = 81/16 > 5, the desired conclusion follows. 

L3 Methods of Proof 

In this section we briefly introduce several methods of proof that are used in this book, namely, 
proof by induction, proof by contradiction, and the pigeonhole principle. In the previous 
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section we saw proof by reduction: in each step the condition to be established was translated 
into another condition until a condition was found that was shown to be true. 

Proofs by induction use predicates, that is, functions of the kind P : IN \— > B. The 
truth value of the predicate P : IN \— > B on the natural number n, denoted P(ri), is 1 or 
depending on whether or not the predicate is True or False. 

Proofs by induction are used to prove statements of the kind, "For all natural numbers 
n, predicate (or property) P is true." Consider the function Si : IN i— ► IN defined by the 
following sum: 

n 
Si(n) = ^j (1.1) 

3 = 1 

We use induction to prove that Si (n) = n(n + l)/2 is true for each n £ IN. 

DEFINITION 1 .3. 1 A proof by induction has a predicate P, a basis step, an induction hy- 
pothesis, and an inductive step. The basis establishes that P(k) is true for integer k. The 
induction hypothesis assumes that for some fixed but arbitrary natural number n > k, the state- 
ments P{k), P(k + 1), . . . , P(n) are true. The inductive step is a proof that P(n + 1) is true 
given the induction hypothesis. 

It follows from this definition that a proof by induction with the predicate P establishes 
that P is true for all natural numbers larger than or equal to k because the inductive step 
establishes the truth of P(n + 1) for arbitrary integer n greater than or equal to k. Also, 
induction may be used to show that a predicate holds for a subset of the natural numbers. For 
example, the hypothesis that every even natural number is divisible by 2 is one that would be 
defined only on the even numbers. 

The following proof by induction shows that S\(n) = n(n + l)/2 for n > 0. 

LEMMA 1.3.1 Poralln > 0, S'i(n) = n(n + l)/2. 

Proof PREDICATE: The value of the predicate P on n, P(n), is True if Si(n) = n(n + 
l)/2 and False otherwise. 

BASIS STEP: Clearly, Si(0) = from both the sum and the closed form given above. 

INDUCTION HYPOTHESIS: Si(k) = k(k + l)/2forfc = 0,1,2, ... ,n. 

INDUCTIVE STEP : By the definition of the sum for Si given in (1 . 1), Si (n + 1 ) = Si (n) + 
n + 1. Thus, it follows that Si(n + 1) = n(n + l)/2 + n + 1. Factoring out n + 1 and 
rewriting the expression, we have that S\(n + 1) = (n + l)((n + 1) + l)/2, exactly the 
desired form. Thus, the statement of the theorem follows for all values of n. ■ 

We now define proof by contradiction. 

DEFINITION 1.3.2 A proof by contradiction has a predicate P. The complement ->P of P is 
shown to be False, which implies that P is True. 

The examples shown earlier of strings in the language L = {00 U 1}* suggest that L 
contains only strings other than e with an odd number of l's. Let P be the predicate "L 
contains strings other than e with an even number of l's." We show that it is true by assuming 
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it is false, namely, by assuming "L contains only strings with an odd number of Is" and 
showing that this statement is false. In particular, we show that L contains the string 1 1 . From 
the definition of the Kleene closure, L contains strings of all lengths in the "letters" 00 and 1 . 
Thus, it contains a string containing two instances of 1 and the predicate P is true. 

Induction and proof by contradiction can also be used to establish the pigeonhole principle. 
The pigeonhole principle states that if there are n pigeonholes, n + 1 or more pigeons, and 
every pigeon occupies a hole, then some hole must have at least two pigeons. We reformulate 
the principle as follows: 

LEMMA 1 .3.2 Given two finite sets A and B with \A\ > \B\, there does not exist a naming 
function v : A i— > B that gives to each element a in A a name via) in B such that every element 
in A has a unique name. 

Proof BASIS: \B\ = 1. To show that the statement is True, assume it is False and show 
that a contradiction occurs. If it is False, every element in A can be given a unique name. 
However, since there is one name (the one element of B) and more than one element in A, 
we have a contradiction. 

INDUCTION HYPOTHESIS: There is no naming function v : A h-> B when \B\ < n and 

\A\>\B\. 

INDUCTIVE STEP: When \B\ = n+ 1 and | A\ > \B\we show there is no naming function 
v : A \— > B. Consider an element b G B. If two elements of A have the name b, the desired 
conclusion holds. If not, remove b from B, giving the set B , and remove from A the 
element, if any, whose name is b, giving the set A' . Since \A'\ > \B'\ and \B'\ < n, by the 
induction hypothesis, there is no naming function obtained by restricting v to A' . Thus, 
the desired conclusion holds. ■ 



1.4 Computational Models 



A variety of computer models are examined in this book. In this section we give the reader 
a taste of five models, the logic circuit, the finite-state machine, the random-access machine, 
the pushdown automaton, and the Turing machine. We also briefly survey the problem of 
language recognition. 

1.4.1 Logic Circuits 

A logic gate is a physical device that realizes a Boolean function. A logic circuit, as defined 
in Section 1.2, is a directed acyclic graph in which all vertices except input vertices carry the 
labels of gates. 

Logic gates can be constructed in many different technologies. To make ideas concrete, 
Fig. 1.3(a) and (b) show electrical circuits for the AND and OR gates constructed with batteries, 
bulbs, and switches. Shown with each of these circuits is a logic symbol for the gate. These 
symbols are used to draw circuits, such as the circuit of Fig. 1 .3(c) for the function (x V y) A z. 
When electrical current flows out of the batteries through a switch or switches in these circuits, 
the bulbs are lit. In this case we say the value of the circuit is True; otherwise it is False. Shown 
below is the truth table for the function mapping the values of the three input variables of the 
circuit in Fig. 1.3(c) to the value of the one output variable. Here x, y, and z have value 1 
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Figure 1 .3 Three electrical circuits simulating logic circuits. 



when the switch that carries its name is closed; otherwise they have value 0. 
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Today's computers use transistor circuits instead of the electrical circuits of Fig. 1 .3. 

Logic circuits execute straight-line programs, programs containing only assignment state- 
ments. Thus, they have no loops or branches. (They may have loops if the number of times 
a loop is executed is fixed.) This point is illustrated by the "full-adder" circuit of Fig. 1 .4, 
a circuit discussed at length in Section 2.7. Each external input and each gate is assigned a 
unique integer. Each is also assigned a variable whose value is the value of the external input 
or gate. The ith vertex is assigned the variable aij. If Xi is associated with a gate that combines 
the results produced at the jth and fcth gates with the operator 0, we write an assignment 
operation of the form Xi := Xj Xk- The sequence of assignment operations for a circuit is 
a straight-line program. Below is a straight-line program for the circuit of Fig. 1.4: 



.r., 
x 5 
■?'(, 
x 7 

x& 



Xi © x 2 
X4 A £3 

X\ A X2 
X\ X3 
£5 V X(S 



The values computed for (xg, x-j) are the standard binary representation for the number of Is 
among X\, X2, and £3. This can be seen by constructing a table of values for X\, X2, x$, Xj, 
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Figure 1.4 A full-adder circuit. Its output pair (xg,x-/) is the standard binary representation 
for the number of Is among its three inputs X\, Xi, and X3. 



and Xg. Full-adder circuits can be combined to construct an adder for binary numbers. (In 
Section 2.2 we give another notation for straight-line programs.) 

As shown in the truth table for Fig. 1 .3(c), each logic circuit has associated with it a binary 
function that maps the values of its input variables to the values of its output variables. In the 
case of the full-adder, since x & and x 7 are its output variables, we associate with it the function 



/] 



(3,2) 
FA 



B" 



B 2 , whose value is f™L {x\,Xz, £3) = {x%,Xj). 



Algebraic circuits are similar to logic circuits except they may use operations over non- 
binary sets, such as addition and multiplication over a ring, a concept explained in Sec- 
tion 6.2.1. Algebraic circuits are the subject of Chapter 6. They are also described by DAGs 
and they execute straight-line programs where the operators are non-binary functions. Alge- 
braic circuits also have associated with them functions that map the values of inputs to the 
values of outputs. 

Logic circuits are the basic building blocks of all digital computers today. When such 
circuits are combined with binary memory cells, machines with memory can be constructed. 
The models for these machines are called finite-state machines. 



1.4.2 Finite-State Machines 

The finite-state machine (FSM) is a machine with memory. It executes a series of steps during 
each of which it takes its current state from the set Q of states and current external input from 
the set E of input letters and combines them in a logic circuit L to produce a successor state 
in Q and an output letter in \&, as suggested in Fig. 1.5. The logic circuit L can be viewed as 
having two parts, one that computes the next-state function 5 : Q x E 1— > Q, whose value 
is the next state of the FSM, and the other that computes the output function A : Q h> f , 
whose value is the output of the FSM in the current state. A generic finite-state machine is 
shown in Fig. 1.5(a) along with a concrete FSM in Fig. 1.5(b) that provides as successor state 
and output the EXCLUSIVE OR of the current state and the external input. The state diagram 
of the FSM in Fig. 1 .5(b) is shown in Fig. 1 .8. Two (or more) finite-state machines that operate 
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Figure 1.5 (a) The finite-state machine (FSM) model; at each unit of time its logic unit, L, 
operates on its current state (taken from its memory) and its current external input to compute an 
external output and a new state that it stores in its memory, (b) An FSM that holds in its memory 
a bit that is the EXCLUSIVE OR of the initial value stored in its memory and the external inputs 
received to the present time. 



in lockstep can be interconnected to form a single FSM. In this case, some outputs of one FSM 
serve as inputs to the other. 

Finite-state machines are ubiquitous today. They are found in microwave ovens, VCRs and 
automobiles. They can be simple or complex. One of the most useful FSMs is the general- 
purpose computer modeled by the random-access machine. 

1.4.3 Random-Access Machine 

The (bounded-memory) random-access machine (RAM) is modeled as a pair of intercon- 
nected finite-state machines, one a central processing unit (CPU) and the other a random- 
access memory, as suggested in Fig. 1 .6. The random-access memory holds m 6-bit words, 
each identified by an address. It also holds an output word (out_wrd) and a triple of inputs 
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Figure 1 .6 The bounded-memory random-access machine. 
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consisting of a command (cmd), an address (addr), and an input data word (in_wrd). cmd 
is either READ, WRITE, or NO-OP. A NO-OP command does nothing whereas a READ com- 
mand changes the value of out jwrd to the value of the data word at address addr. A WRITE 
command replaces the data word at address addr with the value of in jwrd. 

The random-access memory holds data as well as programs, collections of instructions 
for the CPU. The CPU executes the fetch-and-execute cycle in which it repeatedly reads an 
instruction from the random-access memory and executes it. Its instructions typically include 
arithmetic, logic, comparison, and jump instructions. Comparisons are used to decide whether 
the CPU reads the next program instruction in sequence or jumps to an instruction out of 
sequence. 

The general-purpose computer is much more complex than suggested by the above brief 
sketch of the RAM. It uses a rich variety of methods to achieve high speed at low cost with the 
available technology. For example, as the number of components that can fit on a semiconduc- 
tor chip increases, designers have begun to use "super-scalar" CPUs, CPUs that issue multiple 
instructions in each time step. Also, memory hierarchies are becoming more prevalent as de- 
signers assemble collections of slower but larger memories with lower costs per bit to simulate 
expensive fast memories. 

1.4.4 Other Models 

There are many other models of computers with memory, some of which have an infinite 
supply of data words, such as the Turing machine, a machine consisting of a control unit (an 
FSM) and a tape unit that has a potentially infinite linear array of cells each containing letters 
from an alphabet that can be read and written by a tape head directed by the control unit. It 
is assumed that in each time step the head may move only from one cell to an adjacent one on 
the linear array. (See Fig. 1.7.) The Turing machine is a standard model of computation since 
no other machine model has been discovered that performs tasks it cannot perform. 

The pushdown automaton is a restricted form of Turing machine in which the tape is 
used as a pushdown stack. Data is entered, deleted, and accessed only at the top of a stack. A 



1 2 



m — 1 



• • • 



r 



Tape Unit 



Control 
Unit 



Figure 1 .7 The Turing machine has a control unit that is a finite-state machine and a tape unit 
that controls reading and writing by a tape head and the movement of the tape head one cell at a 
time to the left or right of the current position. 
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pushdown stack can be simulated by a tape in which the cell to the right of the tape head is 
always blank. If the tape moves right from a cell, it writes a non-blank symbol in the cell. If it 
moves left, it writes a blank in that cell before leaving it. 

Some computers are serial: they execute one operation on a fixed amount of data per time 
step. Others are parallel; that is, they have multiple (usually communicating) subcomputers 
that operate simultaneously. They may operate synchronously or asynchronously and they may 
be connected via a simple or a complex network. An example of a simple network is a wire 
between two computers. An example of a complex network is a crossbar switch consisting of 
25 switches at the intersection of five columns and five rows of wires; closing the switch at the 
intersection of a row and a column connects the two wires and the two computers to which 
they are attached. 

We close this section by emphasizing the importance of models of computers. Good mod- 
els provide a level of abstraction at which important facts and insights can be developed without 
losing so much detail that the results are irrelevant to practice. 

1.4.5 Formal Languages 

In Chapters 4 and 5 the finite-state machine, pushdown automaton, and Turing machine are 
characterized by their language recognition capability. Formal methods for specifying lan- 
guages have led to efficient ways to parse and recognize programming languages. This is il- 
lustrated by the finite-state machine of Fig. 1.8. Its initial state is go, its final state is gi and 
its inputs can assume values or 1 . An output of is produced when the machine is in state 
go and an output of 1 is produced when it is in state gi . The output before the first input is 
received is 0. 

After the first input the output of the FSM of Fig. 1.8 is equal to the input. After multiple 
inputs the output is the EXCLUSIVE OR of the l's and 0's among the inputs, as we show by 
induction. The inductive hypothesis is clearly true after one input. Suppose it is true after k 
inputs; we show that it remains true after fc+ 1 inputs, and therefore for all inputs. The output 
uniquely determines the state. There are two cases to consider: after k inputs either the FSM is 
in state go or it is in state gi . For each state, there are two cases to consider based on the value 
of the k + 1st input. In all four cases it is easy to see that after the k + 1st input the output is 
the EXCLUSIVE OR of the first k + 1 inputs. 




Initial 



Figure 1 .8 A state diagram for a finite-state machine whose circuit model is given in Fig. 1.5(b). 
go is the initial state of the machine and gi is its final state. If the machine is in g , it has received 
an even number of 1 inputs, whereas if it is in gi, it has received an odd number of l's. 
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The language recognized by an FSM is defined in two ways. It is the set of input strings 
that cause the FSM to produce a particular letter as its last output or to enter one of the set 
of final states on its last input. Thus, the FSM of Fig. 1.8 recognizes the set of binary strings 
containing an odd number of l's. It also recognizes the set of binary strings containing an even 
number of l's because they result in a last output of 0. 

An FSM can also compute a function. The most general function that it computes in 
T steps is the function f M : Q x E i— > Q x f that maps the initial state s and the 
T inputs W\, u>2> ■ ■ ■ » wt to the T outputs y\,yi, ■ ■ ■ , J/t- It can also compute any other 
function obtained by ignoring some outputs or fixing either the initial state or some inputs 
or both. 

The class of languages recognized by finite-state machines (the regular languages) is not 
rich enough to describe easily the important programming languages that are in use today. As 
a consequence, other languages, such as the context-free languages, are employed. Context- 
free languages (which include the regular languages) require computers with potentially un- 
bounded storage for their recognition. The class of computers that recognizes exactly the 
context-free languages are the nondeterministic pushdown automata, pushdown automata in 
which the control unit is nondeterministic; that is, some of its states can have multiple poten- 
tial successor states. 

The strings in regular and context-free languages (and other languages as well) can be 
generated by grammars. A context-free grammar G = (M , T , 1Z, S) consists of sets of terminal 
and non-terminal symbols, T and M respectively, and rules 1Z by which each non-terminal 
is replaced with one or more strings of terminals and non-terminals. All string generations 
start with the special start non-terminal S. The language generated by G, L(G), contains the 
strings of terminal characters produced by rewriting strings in this fashion. This is illustrated 
by the context-free grammar G with two rules shown below. 

EXAMPLE 1 .4. 1 G = (A/", T, K, S), where U = {s}, T = {a, b}, and 11 consists of the two 
rules 

(a) S — > asb (b) s — > ab 

Each application of a rule derives another string, as shown below. This grammar has only 
two derivations, namely S — > aSb and S — > ab. The second derivation is always the last to be 
used. (Recall that the language L(G) contains only terminal strings.) 

S — > aSb 
— > aaSbb 
— > aaaSbbb 
— y aaaabbbb 

As can be seen by inspection, the only strings in L(G) are of the form a b ■ , where a denotes 
the letter a repeated k times. Thus, L(G) = {a k b k \k> 1}. 

Once a grammar for a regular or context-free language is known, it is possible to parse a 
string in the language. In the above example this amounts to determining the number of times 
that the first rule is applied. 

To develop some intuition for the use of the pushdown automaton as a recognizer for 
context-free languages, observe that we can determine the number of applications of the first 
rule in this language by pushing each instance of a onto a stack and then popping o's as b's are 
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encountered. The number of as can then be matched with the number of 6's and if they are 
not equal, the string is declared not in the language. If equal, the number of instances of the 
first rule is determined. 

Programming languages contain strings of characters and digits representing names and 
the values of variables. Such strings can typically be scanned with finite-state machines. Once 
scanned, these strings can be assigned tokens that are then used in a later parsing phase, which 
today is typically based on a generalization of parsing for context-free languages. 



1.5 Computational Complexity 



Computational complexity is examined in concrete and abstract terms. The concrete analysis 
of computational limits is done using models that capture the exchange of space for time. It also 
is done via the study of circuit complexity, the minimal size and depth of circuits for functions. 
Computational complexity is studied abstractly via complexity classes, the classification of 
languages by the time and/or space they need. 

1.5.1 A Computational Inequality 

Computational inequalities play an important role in this book. We now sketch the derivation 
of a computational inequality for the finite-state machine and specialize it to the RAM. The 
idea is very simple: we simulate with a circuit the computation of a function / by an FSM 
and then compare the size of the circuit produced with the size of the smallest circuit for /. 
Simulation, which we use to derive this result, is a central idea in theoretical computer science. 
For example, it is used to show that a problem is NP-complete. We use it here to relate the 
resources available to compute a function / with an FSM to the inherent complexity of/. 

Shown in Fig. 1.5(a) is the standard model for an FSM. As suggested, a circuit L combines 
the current state held in the memory M together with an external input to form an external 
output and a successor state which is held in M. If the input, output, and state are represented 
as binary tuples, the circuit L can be realized by a logic circuit with Boolean gates. Let the 
FSM compute the function / : B n i— > B m in T steps; that is, its state and/or T external 
inputs contain the n Boolean inputs to / and its T outputs contain the rn Boolean outputs of 
/. (The inputs and outputs must appear in the same positions on each computation to prevent 
the application of hidden computational resources.) 

The function / can also be computed by the circuit shown in Fig. 1.9, which is obtained 
by unwinding the loop of Fig. 1.5(a) using T copies of the logic circuit L for the FSM. This 





Figure 1 .9 A circuit that computes the same function as an FSM (see Fig. 1 .5(a)) in T steps. It 
has the same initial state s, receives the same inputs and produces the same outputs. 
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follows because the inputs X\, x%, ■ ■ ■ , %t that would be given to the FSM over time can be 
given simultaneously to this circuit and it will produce the T outputs that would be produced 
by the FSM. This circuit has T ■ C(L) gates, where C(L) is the actual or equivalent number of 
gates used to realize L. (The circuit L may be realized with a technology that does not formally 
use gates.) Since this circuit is not necessarily the smallest circuit for the function /, we have 
the following inequality, where C(f) is the size of the smallest circuit for /: 

C(f)<T-C(L) 

This result is important because it imposes a constraint on every computation done by a 
sequential machine. This inequality has two interpretations. First, if the product T ■ C(L) 
(the equivalent number of logic operations employed) of the number of time steps T and 
the equivalent number of logic operations C(L) per step is too small, namely, less than C(f), 
the FSM cannot compute function / because the above inequality would be violated. This is 
a form of impossibility theorem for bounded computations. Second, a complex function, 
one for which C(f) is large, requires a large value for the product T-C(L). In light of the first 
interpretation of T ■ C(L) as the equivalent number of logic operations employed, it makes 
sense to call W = T ■ C(L) the computational work done by the FSM to compute /. 

The above computational inequality can be specialized to the bounded-memory RAM with 
S bits of memory. When S is large, as it usually is, C(L) for the RAM is proportional to S. As 
a consequence, for the RAM we have the following computational inequality for some positive 
constant k: 

C(f) < kST 

This inequality shows the central role of circuit complexity in theoretical computer science. It 
also demonstrates that the space-time product, ST, is an important measure of the complexity 
of a problem. Functions with large circuit size can be computed by a RAM only if it either has 
a large storage capacity or executes many time steps or both. Similar results exist for the Turing 
machine. 

1.5.2 Tradeoffs in Space, Time, and I/O Operations 

Computational inequalities of the kind sketched above are important but often difficult to 
apply because it is hard to show that functions have a large circuit size. For this reason space- 
time tradeoffs have been studied under the assumption that the type of algorithm or program 
allowed is restricted. For example, if only straight-line programs are considered, then the pebble 
game sketched below and discussed in detail in Chapter 10 can be used to derive tradeoff 
inequalities. 

The standard pebble game is played on a directed acyclic graph (DAG), the graph of a 
straight-line program. The input vertices of a DAG have no edges directed into them. Output 
vertices have no edges directed away from them. Internal vertices are non-input vertices. A 
predecessor of a vertex v is a vertex u that has an edge directed to v. The pebble game is played 
with pebbles that are placed on vertices according to the following rules: 

• Initially no vertices carry pebbles. 

• A pebble can be placed on an input vertex at any time. 
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• A pebble can be placed on an internal vertex only if all of its predecessor vertices carry 
pebbles. 

• The pebble moved to a vertex can be a pebble residing on one of its immediate predecessors. 

• A pebble can be removed from a vertex at any time. 

• Every output vertex must carry a pebble at some time. 

Space S in this game is the maximum number of pebbles used to play the game on a 
DAG. Time T is the number of times that pebbles are placed on vertices. If enough pebbles 
are available to play the game, each vertex is pebbled once and T is the number of vertices in 
the graph. If, however, there are not enough pebbles, some vertices will have to be pebbled 
more than once. In this case a tradeoff between space and time will be exhibited. 

For a particular DAG G we may seek to determine the minimum number of pebbles, S m i„, 
needed to place pebbles on all output vertices at some time and for a given number of pebbles S 
to determine the minimum time T needed when S pebbles are used. Methods for computing 
5 m j n and bounding S and T simultaneously have been developed. For example, the four- 
point (four-input) fast Fourier transform (FFT) graph shown in Fig. 1.10 has Smin = 3 and 
can be pebbled in the minimum number of steps with five pebbles. 

Let the FFT graph of Fig. 1.10 be pebbled with the minimum number S of pebbles. 
Initially no pebbles reside on the graph. Thus, there is a first point in time at which S pebbles 
reside on the graph. The dark gray vertices identify one possible placement of pebbles at such 
a point in time. The light gray vertices will have had pebbles placed on them prior to this time 
and will have to be repebbled again later to pebble output vertices that cannot be reached from 
the placement of the dark gray vertices. This demonstrates that for this graph if the minimum 
number of pebbles is used, some vertices will have to be repebbled. Although the n-point 
FFT graph, n a power of two, has only n log n + n vertices, we show in Section 10.5.5 that its 
vertices must be repebbled enough times that S and T satisfy (5*+ \)T > n /16. Thus, either 
S is much larger than the minimum space or T is much larger than the number of vertices 
or both. 
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1,9 2,10 4,12 5,13 

Figure 1 . 1 A pebbling of a four-input FFT graph at the point at which the maximum num- 
ber of pebbles (three) is used. Numbers specify the order in which vertices can be pebbled. A 
maximum of three pebbles is used. Some vertices are pebbled twice. 
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Space-time tradeoffs can also be studied with the branching program, a type of program 
that permits data-dependent computations. (See Section 10.9.) While branching programs 
provide more flexibility than does the pebble game, they are worth considering only for prob- 
lems in which the algorithms used involve branching and have access to an external random- 
access memory to permit data-dependent reading of inputs, a strong assumption. For many 
problems only straight-line programs are used, in which case the pebble game is the model of 
choice. 

A serious problem arises when the storage capacity of a primary memory is too small for 
a problem, so that a slow secondary memory, such as a disk, must be used for intermediate 
storage. This results in time-consuming input/output operations (I/O) between primary and 
secondary memory. If too many I/O operations are done, the overall performance of the system 
can deteriorate markedly. This problem has been exacerbated by the growing disparity between 
the speed of CPUs and that of memories; the speed of CPUs is increasing over time at a greater 
rate than that of memories. In fact, the latency of a disk, the time between the issuance of a 
request for data and the time it is answered, can be 100,000 to 1,000,000 times the length of a 
CPU cycle. As a consequence, the amount of time spent swapping data between primary and 
secondary memory may dominate the time to perform computations. A second pebble game, 
the red-blue pebble game, has been introduced to study this problem. (See Chapter 11.) 

The red-blue pebble game is played with both red and blue pebbles. The (hot) red pebbles 
correspond to primary memory locations and the (cool) blue pebbles correspond to secondary 
memory locations. Red pebbles are played according to the rules of the above pebble game. 
The additional rules that apply to the red and blue pebbles allow a red pebble to be swapped 
for a blue one and vice versa. In addition, blue pebbles reside only on inputs initially and 
must reside on outputs finally. The number of red pebbles is limited, but the number of blue 
pebbles is not. 

The goal of the red-blue pebble game is to minimize the number of times that red and 
blue pebbles are swapped, since each swap corresponds to an expensive input/output (I/O) 
operation. Let T be the number of I/O operations and S be the number of red pebbles. 
Upper and lower bounds on the exchange of S for T have been derived for a large number of 
problems. For example, for the problem of multiplying two n x n matrices in about 2n steps 
with the classical algorithm, it has been shown that a red-blue pebble-game strategy leads to a 
product ST 2 proportional to n 6 and that this cannot be beaten except by a small multiplicative 
factor. 

1.5.3 Complexity Classes 

Complexity classes provide a way to group languages of similar computational complexity. For 
example, the nondeterministic polynomial-time languages (NP) are languages that can be 
solved in time that is polynomial in the size of their input when the machine in question is 
a nondeterministic Turing machine (TM). Nondeterministic Turing machines can have more 
than one state that is a successor to the current state for the current input. Thus, they can 
make choices between successor states. A language L is in NP if there is a nondeterministic 
TM such that, given an arbitrary string in L, there is some choice of successor states for the 
TM control unit that causes the TM to enter an accepting state in a number of steps that is 
polynomial in the length of the input. 

An NP-complete language Lo must satisfy two conditions. First, Lq must be in NP and 
second, it must be true that for each language L in NP a string x in L can be translated 
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into a string y of Lq using an algorithm whose running time is a polynomial in the length 
of a; such that y is in Lq if and only if x is in L. As a consequence of this definition, if any 
NP-complete language can be solved in deterministic polynomial time, then every language in 
NP can, including all the other NP-complete languages. However, the best algorithms known 
today for NP-complete languages all have exponential running time. Thus, for long strings 
these algorithms are impractical. If solutions to large NP-complete languages are needed, we 
are limited to approximate solutions. 

1.5.4 Circuit Complexity 

Circuit complexity is a notoriously difficult subject. Despite decades of research, we have 
failed to find methods to show that individual functions have super-polynomial circuit size 
or more than poly-logarithmic depth. Nonetheless, the circuit is such a simple and appealing 
model that it continues to attract a considerable amount of attention. Some very interesting 
exponential lower bounds on circuit size have been derived when the circuits are monotone, 
that is, realized by AND and OR gates but no NOTs. 



1.6 Parallel Computation 



The VLSI machine and the PRAM are examples of parallel machines. The VLSI machine 
reflects constraints that exist when finite-state machines are realized through the very large- 
scale integration of components on semiconductor chips. In the VLSI model the area of a chip 
is important because large chips have a much higher probability of containing a disabling defect 
than smaller ones. Consequently, the absolute size of chips is limited. However, the width of 
lines that can be drawn on chips has been shrinking over time, thereby increasing the number 
of wires, gates, and binary memory cells that can be placed on them. This has the effect of 
increasing the effective chip area, the real chip area normalized by the cross section of wires. 

Figure 1 . 1 1 (a) is a VLSI diagram representing the types of material that can be deposited on 
the surface of a pure crystalline semiconductor substrate to form different types of conducting 
regions. Some of the rectangular regions serve as wires whereas overlaps of other regions create 
transistors. In turn, collections of transistors form gates. This VLSI diagram describes a NAND 
gate, a gate whose Boolean function is the NOT of the AND of its two inputs. Shown in 
Fig. 1 . 1 1 (b) is the logic symbol for the NAND gate. The small circle at the output of the AND 
gate denotes the NOT of the gate value. 

Given the premium attached to chip real estate, a large number of economical and very 
regular finite-state machine designs have been made for VLSI chips. One of the most im- 
portant of these is the systolic array, a one- or two-dimensional array of processors (FSMs) 
that are identical, except possibly for those along the periphery of the array. These processors 
operate in synchrony; that is, they perform the same operation at the same time. They also 
communicate only with their nearest neighbors. (The word "systolic" is derived from "systole," 
a "rhythmically recurrent contraction" such as that of the heart.) 

Systolic arrays are typically used to compute specific functions such as the convolution 
c = a® b of the n-tuple a = (do, d\, ■ ■ ■ , a n -i) with the m-tuple b = (b , b\, . . . , b m _{). 
The jth component, Cj, of the convolution c = a ® b, < j < (n + m — 2), is defined as 

Cj = y a r * b s 

r-\-s—j 
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Figure I . I I (a) A layout diagram for a VLSI chip and (b) its logic symbol. 
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Figure 1.12 A systolic array for the convolution of two binary sequences. 
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It is assumed that the components of a and b are drawn from a set over which the operations 
of * (multiplication) and J^ (addition) are defined, such as the integers. 

Shown schematically in Fig. 1.12 on page 28 is the one-dimensional systolic array for the 
convolution c = a <g) b at the second, fourth, fifth, and sixth steps of execution on input 
vectors a = (a , a,\, a 2 ) and b = (b , b\, b 2 ). The components of these vectors are fed from 
the left and right, respectively, spaced by zero elements. The first component of a enters the 
array one step ahead of the first component of b. The result of the convolution is the vector 
C = (co, C\, Ci, C3, C4). There is one more cell in the array than there are components in the 
result. At each step the components of a and b in each cell are multiplied and added to the 
previous value of the component of c in that cell. After all components of the two input vectors 
pass through the cell, the convolution is computed. 

The processors of a parallel computer generally do not communicate only with nearest 
neighbors, as in the systolic array. Instead, processors often can communicate with remote 
neighbors via a network. The type of networks chosen for a parallel computer can have a large 
impact on their effectiveness. 

The processors of the PRAM mentioned in Section 1 . 1 operate synchronously, alternating 
between accessing a global memory and computing locally. Since the processors communicate 
by writing and reading values to and from the global memory, all processors are at the same 
distance from one another. Although the PRAM model makes two unrealistic assumptions, 
namely that processors a) can act in synchrony and b) they can communicate directly via global 
memory, it remains a good model in which to explore problems that are hard to parallelize, 
even with the flexibility offered by this model. 



Problems 

MATHEMATICAL PRELIMINARIES 

1.1 Show that the sum S(k) below has value S(k) = 2 k — 1: 

S(k) = 2 k ~ l + 2 k ~ 2 + • ■ ■ + 2 1 + 2° 

SETS, LANGUAGES, INTEGERS, AND GRAPHS 

1.2 Let A = {red, green, blue}, B = {green, violet}, and C = {red, yellow, blue, green}. 
Determine the elements in (A f~l C) x (B — C). 

1 .3 Let the relation KC IN x IN be defined by pairs (a, b) such that a and b have the same 
remainder on division by 3. Show that R is an equivalence relation. 

1.4 Let R C A x A be an equivalence relation. Let the set E[a] be the elements in A 
equivalent under the relation R to the element a. Show that for all a, b G A the 
equivalence classes E[a] and E[b] are either equal or disjoint. Also show that A is the 
union of all equivalence classes. 

1.5 In terms of the Kleene closure and the concatenation of sets, describe the languages 
containing the following: 

a) Strings over {0, 1} beginning with 01. 

b) Strings beginning with that alternate between and 1 . 
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1 .6 Describe an algorithm to convert numbers from decimal to binary notation. 

1.7 A graph G = (V, E) can be described by adjacency lists, one list for each vertex in the 
graph. The adjacency list for vertex v G V is a list of vertices to which there is an edge 
from v. Generate adjacency lists for the two graphs of Fig. 1.2. 

TASKS AS FUNCTIONS 

1.8 Let Z5 be the set {0, 1, 2, 3, 4}. Let the addition operator © over this set be modulo 5; 
that is, if x and y are two such integers, x © y is obtained by adding x and y as integers 
and taking the remainder after division by 5. For example, 2 © 2 = 4 mod 5 whereas 
3 © 4 = 7 = 2 mod 5. Provide a table describing the function / ffi : S5 x S5 w Z5. 

1 .9 Give a truth table for the Boolean function whose value is True exactly when either x 
or y or both is True and z is False. 

RATE OF GROWTH OF FUNCTIONS 

1.10 For each of the fifteen unordered pairs of functions / and g below, determine whether 
f{n) = 0(g(n)), f(n) = n(g(n)), or /(n) = Q(g(n)). 

a) n 3 ; c) n 6 ; e) n i \og 2 n; 

b) 2 nl °S2"; d) n2 n ; f) 2 2 ". 

1.11 Show that 2.7n 2 + 6y / n\\og 2 n} < 8.7n 2 forn> 3. 

METHODS OF PROOF 

1.12 Let S r (n) = $7Jj=i J r denote a sum of powers of integers. Use proof by induction to 
show that the following identities on arithmetic series hold: 

a) 5 2 (n) = ^ + ^ + f 

b) S 3 (n) = ^ + £ + $ 

COMPUTATIONAL MODELS 

1.13 Produce a circuit and straight-line program for the Boolean function described in Prob- 
lem 1.9. 

1.14 A state diagram for a finite-state machine is a graph containing one vertex (or state) 
for each pattern of data that can be held in its memory and an edge from state p to 
state q if there is a value for the input data that causes the memory to change from p 
to q. Such an edge is labeled with the value of the input data that causes the transition. 
Outputs are generated by a finite-state machine when it is in a state. The vertices of its 
state diagram are labeled by these outputs. 

Provide a state diagram for the finite-state machine described in Fig. 1.5(b). 

1.15 Using the straight-line program given for the full-adder circuit in Section 1.4.1, describe 
how such a program would be placed in the random-access memory of the RAM and 
how the RAM would run the fetch-and-execute cycle to compute the values produced 
by the full-adder circuit. This is an example of circuit simulation by a program. 



©John E Savage Problems 31 

1.16 Describe the actions that could be taken by a Turing machine to simulate a circuit from 
a straight-line program for it. Illustrate your approach by applying it to the simulation 
of the full-adder circuit described in Section 1.4.1. 

1.17 Suppose you are told that a function is computed in four time steps by a very simple 
finite-state machine, one whose logic circuit (but not its memory) can be realized with 
four logic gates. Suppose you are also told that the same function cannot be computed 
by a logic circuit with fewer than 20 logic gates. What can be said about these two 
statements? Explain your answer. 

1.18 Describe a finite-state machine that recognizes the language consisting of those strings 
over {0, 1} that end in 1. 

1.19 Determine the language generated by the context-free grammar G = (Af,T,TZ,s) 
where M = {s, M, n}, T = {a, b, c, d] and 1Z consists of the rules given below. 

a) S — > MN d) N — > cNd 

b) M — > aUb e) N — > cd 

c) M — > ab 

COMPUTATIONAL COMPLEXITY 

1 .20 Using the rules for the red pebble game, show how to pebble the FFT graph of Fig. 1.10 
with five red pebbles by labeling the vertices with the time step on which it is pebbled. 
If a vertex has to be repebbled, it will be pebbled on two time steps. 

1 .21 Suppose that you are told that the n-point FFT graph can be pebbled with y/n pebbles 
in n/4 time steps for n > 37. What can you say about this statement? 

1.22 You have been told that the FFT graph of Fig. 1.10 cannot be pebbled with fewer than 
three red pebbles. Show that it can be pebbled with two red pebbles in the red-blue 
pebble game by sketching how you would use blue pebbles to achieve this objective. 

PARALLEL COMPUTATION 

1.23 Using Fig. 1.12 as a guide, design a systolic array to convolve two sequences of length 
two. Sketch out each step of the convolution process. 

1 .24 Consider a version of the PRAM consisting of a collection of RAMs (see Fig. 1.13) with 
small local random-access memories that repeat the following three-step cycle until they 
halt: a) they simultaneously read one word from a common global memory, b) they 
execute one local instruction using local memory, and c) they write one word to the 
common memory. When reading and writing, the individual processors are allowed 
to read and write from the same location. If two RAMs write to the same location, 
they must be programmed so that they write a common value. (This is known as the 
concurrent-read, concurrent-write (CRCW) PRAM.) Each RAM has a unique integer 
associated with it and can use this number to decide where to read or write in the 
common memory. 

Show that the CRCW PRAM can compute the AND of n Boolean variables in two 

cycles. 

Hint: Reserve one word in common memory and initialize it with and assign RAMs 

to the appropriate memory cells. 
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Figure 1.13 The PRAM model is a collection of synchronous RAMs accessing a common 
memory. 



Chapter Notes 



Since this chapter introduces concepts used elsewhere in the book, we postpone the biblio- 
graphic citations to later chapters. We remark here, however, that the notation for the rate of 
growth of functions in Section 1.2.8 is due to Knuth [171]. The reader interested in more in- 
formation on the development of the digital computer, ranging from Babbage's seminal work 
in the 1830s to the pioneering work of the 1940s, should consult the collection of papers 
selected and edited by Brian Randell [268] . 



Part II 

GENERAL COMPUTATIONAL 

MODELS 



CHAPTER 



Logic Circuits 



Many important functions are naturally computed with straight-line programs, programs 
without loops or branches. Such computations are conveniently described with circuits, di- 
rected acyclic graphs of straight-line programs. Circuit vertices are associated with program 
steps, whereas edges identify dependencies between steps. Circuits are characterized by their 
size, the number of vertices, and their depth, the length (in edges) of their longest path. 
Circuits in which the operations are Boolean are called logic circuits, those using algebraic 
operations are called algebraic circuits, and those using comparison operators are called com- 
parator circuits. In this chapter we examine logic circuits. Algebraic and comparator circuits 
are examined in Chapter 6. 

Logic circuits are the basic building blocks of real-world computers. As shown in Chap- 
ter 3, all machines with bounded memory can be constructed of logic circuits and binary 
memory units. Furthermore, machines whose computations terminate can be completely sim- 
ulated by circuits. 

In this chapter circuits are designed for a large number of important functions. We begin 
with a discussion of circuits, straight-line programs, and the functions computed by them. 
Normal forms, a structured type of circuit, are examined next. They are a starting point for 
the design of circuits that compute functions. We then develop simple circuits that combine 
and select data. They include logical circuits, encoders, decoders, multiplexers, and demulti- 
plexers. This is followed by an introduction to prefix circuits that efficiently perform running 
sums. Circuits are then designed for the arithmetic operations of addition (in which prefix 
computations are used), subtraction, multiplication, and division. We also construct efficient 
circuits for symmetric functions. We close with proofs that every Boolean function can be 
realized with size and depth exponential and linear, respectively, in its number of inputs, and 
that most Boolean functions require such circuits. 

The concept of a reduction from one problem to a previously solved one is introduced in 
this chapter and applied to many simple functions. This important idea is used later to show 
that two problems, such as different NP-complete problems, have the same computational 
complexity. (See Chapters 3 and 8.) 
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2.1 Designing Circuits 

The logic circuit, as defined in Section 1.4.1, is a directed acyclic graph (DAG) whose vertices 
are labeled with the names of Boolean functions (logic gates) or variables (inputs). Each logic 
circuit computes a binary function / : B n (— > B m that is a mapping from the values of its n 
input variables to the values of its m outputs. 

Computer architects often need to design circuits for functions, a task that we explore in 
this chapter. The goal of the architect is to design efficient circuits, circuits whose size (the 
number of gates) and/or depth (the length of the longest path from an input to an output 
vertex) is small. The computer scientist is interested in circuit size and depth because these 
measures provide lower bounds on the resources needed to complete a task. (See Section 1.5.1 
and Chapter 3.) For example, circuit size provides a lower bound on the product of the 
space and time needed for a problem on both the random-access and Turing machines (see 
Sections 3.6 and 3.9.2) and circuit depth is a measure of the parallel time needed to compute 
a function (see Section 8.14.1). 

The logic circuit also provides a framework for the classification of problems by their com- 
putational complexity. For example, in Section 3.9.4 we use circuits to identify hard compu- 
tational problems, in particular, the P-complete languages that are believed hard to parallelize 
and the NP-complete languages that are believed hard to solve on serial computers. After more 
than fifty years of research it is still unknown whether NP-complete problems have polynomial- 
time algorithms. 

In this chapter not only do we describe circuits for important functions, but we show that 
most Boolean functions are complex. For example, we show that there are so many Boolean 
functions on n variables and so few circuits containing C or fewer gates that unless C is large, 
not all Boolean functions can be realized with C gates or fewer. 

Circuit complexity is also explored in Chapter 9. The present chapter develops methods 
to derive lower bounds on the size and depth of circuits. A lower bound on the circuit size 
(depth) of a function / is a value for the size (depth) below which there does not exist a circuit 
for /. Thus, every circuit for / must have a size (depth) greater than or equal to the lower 
bound. In Chapter 9 we also establish a connection between circuit depth and formula size, 
the number of Boolean operations needed to realize a Boolean function by a formula. This 
allows us to derive an upper bound on formula size from an upper bound on depth. Thus, the 
depth bounds of this chapter are useful in deriving upper bounds on the size of the smallest 
formulas for problems. Prefix circuits are used in the present chapter to design fast adders. 
They are also used in Chapter 6 to design fast parallel algorithms. 



2.2 Straight-Line Programs and Circuits 



As suggested in Section 1.4.1, the mapping between inputs and outputs of a logic circuit can 
be described by a binary function. In this section we formalize this idea and, in addition, 
demonstrate that every binary function can be realized by a circuit. Normal-form expansions 
of Boolean functions play a central role in establishing the latter result. Circuits were defined 
informally in Section 1.4.1. We now give a formal definition of circuits. 

To fix ideas, we start with an example. Figure 2.1 shows a circuit that contains two AND 
gates, one OR gate, and two NOT gates. (Circles denote NOT gates, AND and OR gates are 
labeled A and V, respectively.) Corresponding to this circuit is the following functional de- 
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Figure 2. 1 A circuit is the graph of a Boolean straight-line program. 
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scription of the circuit, where §j is the value computed by the jth input or gate of the circuit: 



(2.1) 



The statement g\ := x; means that the external input x is the value associated with the first 
vertex of the circuit. The statement g$ := g x ; means that the value computed at the third 
vertex is the NOT of the value computed at the first vertex. The statement (75 := g\ A 34; 
means that the value computed at the fifth vertex is the AND of the values computed at the 
first and fourth vertices. The statement g-j := 35 V gg; means that the value computed at the 
seventh vertex is the OR of the values computed at the fifth and sixth vertices. The above is 
a description of the functions computed by the circuit. It does not explicitly specify which 
function(s) are the outputs of the circuit. 

Shown below is an alternative description of the above circuit that contains the same infor- 
mation. It is a straight-line program whose syntax is closer to that of standard programming 
languages. Each step is numbered and its associated purpose is given. Input and output 
steps are identified by the keywords READ and OUTPUT, respectively. Computation steps 
are identified by the keywords AND, OR, and NOT. 



( 1 READ x) 

(2 READ y) 

(3 NOT 1) 

(4 NOT 2) 

(5 AND 1 4) 



(6 AND 3 2) 

(7 OR 5 6) 

(8 OUTPUT 5) 

(9 OUTPUT 7) 



(2.2) 



The correspondence between the steps of a straight-line program and the functions computed 
at them is evident. 

Straight-line programs are not limited to describing logic circuits. They can also be used to 
describe algebraic computations. (See Chapter 6.) In this case, a computation step is identified 
with a keyword describing the particular algebraic operation to be performed. In the case of 
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logic circuits, the operations can include many functions other than the basic three mentioned 
above. 

As illustrated above, a straight-line program can be constructed for any circuit. Similarly, 
given a straight-line program, a circuit can be drawn for it as well. We now formally define 
straight-line programs, circuits, and characteristics of the two. 

DEFINITION 2.2. 1 A straight-line program is set of steps each of which is an input step, de- 
noted (s READ x), an output step, denoted (s OUTPUT i), or a computation step, denoted 
(s OP i ... k). Here s is the number of a step, x denotes an input variable, and the keywords 
READ, OUTPUT, and OP identify steps in which an input is read, an output produced, and the 
operation OP is performed. In the sth computation step the arguments to OP are the results produced 
at steps i, . . . ,k. It is required that these steps precede the sth step; that is, s > i, . . . ,k. 

A circuit is the graph of a straight-line program. (The requirement that each computation 
step operate on results produced in preceding steps insures that this graph is a DAG.) The fan-in 
of the circuit is the maximum in-degree of any vertex. The fan-out of the circuit is the maximum 
outdegree of any vertex. A gate is the vertex associated with a computation step. 

The basis fl of a circuit and its corresponding straight-line program is the set of operations 
that they use. The bases of Boolean straight-line programs and logic circuits contain only Boolean 
functions. The standard basis, Qq, for a logic circuit is the set {AND, OR, NOT}. 



2.2. 1 Functions Computed by Circuits 

As stated above, each step of a straight-line program computes a function. We now define the 
functions computed by straight-line programs, using the example given in Eq. (2.2). 

DEFINITION 2.2.2 let g s be the function computed by the sth step of a straight-line pro- 
gram. If the sth step is the input step (s READ x), then g s = x. If it is the computation 
step (s OP i ... k), the function is g s = OP(<?i, . . . ,<?&), where gi,...,gk a re the functions 
computed at steps on which the sth step depends. If a straight-line program has n inputs and m 
outputs, it computes a function f : B n i— > B m . If S\, S%, ■ ■ ■ , s m are the output steps, then 
f = (g Sl ,g S2 , . . . ,g Sm ). The function computed by a circuit is the function computed by the 
corresponding straight-line program. 



The functions computed by the logic circuit of Fig. 2.1 are given below. The expression 
for g s is found by substituting for its arguments the expressions derived at the steps on which 
it depends. 
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The function computed by the above Boolean straight-line program is f(x, y) = (g$, g-/). 
The table of values assumed by / as the inputs x and y run through all possible values is shown 
below. The value of g 7 is the EXCLUSIVE OR function. 
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We now ask the following question: "Given a circuit with values for its inputs, how can the 
values of its outputs be computed?" One response it to build a circuit of physical gates, supply 
values for the inputs, and then wait for the signals to propagate through the gates and arrive at 
the outputs. A second response is to write a program in a high-level programming language to 
compute the values of the outputs. A simple program for this purpose assigns each step to an 
entry of an array and then evaluates the steps in order. This program solves the circuit value 
problem; that is, it determines the value of a circuit. 

2.2.2 Circuits That Compute Functions 

Now that we know how to compute the function defined by a circuit and its corresponding 
straight-line program, we ask: given a function, how can we construct a circuit (and straight- 
line program) that will compute it? Since we presume that computational tasks are defined by 
functions, it is important to know how to build simple machines, circuits, that will solve these 
tasks. In Chapter 3 we show that circuits play a central role in the design of machines with 
memory. Thus, whether a function or task is to be solved with a machine without memory (a 
circuit) or a machine with memory (such as the random-access machine), the circuit and its 
associated straight-line program play a key role. 

To construct a circuit for a function, we begin by describing the function in a table. As 
seen earlier, the table for a function /("- m ) : B n i— > B m has n columns containing all 2™ 
possible values for the n input variables of the function. Thus, it has 2" rows. It also has 
m columns containing the m outputs associated with each pattern of n inputs. If we let 
x\, X2, ■ ■ ■ , x n be the input variables of / and let 2/i> 2/2. ■ ■ ■ » Vm be its output variables, 
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Figure 2.2 The truth table for the function / e ^' le . 
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then we write f(x\, x%, . . . , x n ) = (j/i, y 2 , ■ ■ ■ , y m )- This is illustrated by the function 

/exkmple^i'^. £3) = (2/1.2/2) defined in Fig. 2.2 on page 39. 

A binary function is one whose domain and codomain are Cartesian products of B = 
{0, 1}. A Boolean function is a binary function whose codomain consists of the set B. In 
other words, it has one output. 

As we see in Section 2.3, normal forms provide standard ways to construct circuits for 
Boolean functions. Because a normal-form expansion of a function generally does not yield 
a circuit of smallest size or depth, methods are needed to simplify the algebraic expressions 
produced by these normal forms. This topic is discussed in Section 2.2.4. 

Before exploring the algebraic properties of simple Boolean functions, we define the basic 
circuit complexity measures used in this book. 



2.2.3 Circuit Complexity Measures 

We often ask for the smallest or most shallow circuit for a function. If we need to compute 
a function with a circuit, as is done in central processing units, then knowing the size of the 
smallest circuit is important. Also important is the depth of the circuit. It takes time for 
signals applied to the circuit inputs to propagate to the outputs, and the length of the longest 
path through the circuit determines this time. When central processing units must be fast, 
minimizing circuit depth becomes important. 

As indicated in Section 1.5, the size of a circuit also provides a lower bound on the space- 
time product needed to solve a problem on the random-access machine, a model for modern 
computers. Consequently, if the size of the smallest circuit for a function is large, its space-time 
product must be large. Thus, a problem can be shown to be hard to compute by a machine 
with memory if it can be shown that every circuit for it is large. 

We now define two important circuit complexity measures. 

DEFINITION 2.2.3 The size of a logic circuit is the number of gates it contains. Its depth is the 
number of gates on the longest path through the circuit. The circuit size, Cn(/), and circuit 
depth, Dn(f), of a Boolean function f : B n 1— > B m are defined as the smallest size and smallest 
depth of any circuit, respectively, over the basis flfor f. 

Most Boolean functions on n variables are very complex. As shown in Sections 2.12 and 
2.13, their circuit size is proportional to 2"/n and their depth is approximately n. Fortunately, 
most functions of interest have much smaller size and depth. (It should be noted that the circuit 
of smallest size for a function may be different from that of smallest depth.) 



2.2.4 .Algebraic Properties of Boolean Functions 

Since the operations AND (A), OR (V), EXCLUSIVE OR (©), and NOT (-1 or -) play a vital 
role in the construction of normal forms, we simplify the subsequent discussion by describing 
their properties. 

If we interchange the two arguments of AND, OR, or EXCLUSIVE OR, it follows from their 
definition that their values do not change. This property, called commutativity, holds for all 
three operators, as stated next. 
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COMMUTATIVITY 

X\ V x 2 = X2 V X\ 
X\ A X2 = X2 A X\ 
X\ © X 2 = X2 © £1 

When constants are substituted for one of the variables of these three operators, the expression 
computed is simplified, as shown below. 

SUBSTITUTION OF CONSTANTS 

X\ V = X\ Xi A 1 = x x 

ii V 1 = 1 x\ © = x\ 

ii A = x\<§\=x\ 

Also, when one of the variables of one of these functions is replaced by itself or its negation, 
the functions simplify, as shown below. 

ABSORPTION RULES 
X\ V x\ = X\ x\ A X\ = x\ 

x\\l x\ = \ X\ A X\ = 

£1 © X\ = .Ti V (Xi A 2) 2 ) = X! 

X\ © Xj = 1 ii A (a)i V 2)2) = £1 

To prove each of these results, it suffices to test exhaustively each of the values of the arguments 
of these functions and show that the right- and left-hand sides have the same value. 

DeMorgan's rules, shown below, are very important in proving properties about circuits 
because they allow each AND gate to be replaced by an OR gate and three NOT gates and vice 
versa. The rules can be shown correct by constructing tables for each of the given functions. 

DEMORGAN'S RULES 



(xi V X2) = X\ A X2 



{x\ A X2) = Xi V X2 

The functions AND, OR, and EXCLUSIVE OR are all associative; that is, all ways of combining 
three or more variables with any of these functions give the same result. (An operator is 
associative if for all values of a, b, and c, a (b © c) = (a © b) © c.) Again, proof by 
enumeration suffices to establish the following results. 

ASSOCIATIVITY 

X\ V (x2 V X3) = (xi V X2) V x 3 
Xi A (x2 A X3) = (xi A X2) A £3 
X\ © (X2 © X3) = (xi © X2) © £3 

Because of associativity it is not necessary to parenthesize repeated uses of the operators V, A, 
and©. 

Finally, the following distributive laws are important in simplifying Boolean algebraic 
expressions. The first two laws are the same as the distributivity of integer multiplication over 
integer addition when multiplication and addition are replaced by AND and OR. 
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DISTRIBUTIVITY 
X\ A (x 2 V X3) = [x\ A X%) V (x\ A X3) 
Xi A (X2 © X3) = (xi A X2) © (xi A X3) 
X\ V (X2 A X3) = (x\ V X2) A {x\ V S3) 

We often write x Ay as xy. The operator A has precedence over the operators V and ffi, which 
means that parentheses in (x A y) V z and (x A y) ffi z may be dropped. 
The above rules are illustrated by the following formula: 



(x A (y © z)) A (x V y) = (x V (y © 2)) A (x V y) 
= (iV(f©z))A(iVy) 
= xV(yA(yffiz)) 
= iV((yAj)®(!/Az)) 
= x V (0 © y A z) 
= x V (y A z) 

DeMorgan's second rule is used to simplify the first term in the first equation. The last 
rule on substitution of constants is used twice to simplify the second equation. The third 
distributivity rule and commutativity of A are used to simplify the third one. The second 
distributivity rule is used to expand the fourth equation. The fifth equation is simplified by 
invoking the third absorption rule. The final equation results from the commutativity of © 
and application of the rule X\ © = X\ . When there is no loss of clarity, we drop the operator 
symbol A between two literals. 

2.3 Normal-Form Expansions of Boolean Functions 

Normal forms are standard ways of constructing circuits from the tables defining Boolean 
functions. They are easy to apply, although the circuits they produce are generally far from 
optimal. They demonstrate that every Boolean function can be realized over the standard basis 
as well as the basis containing AND and EXCLUSIVE OR. 

In this section we define five normal forms: the disjunctive and conjunctive normal forms, 
the sum-of-products expansion, the product-of-sums expansion, and the ring-sum expansion. 

2.3.1 Disjunctive Normal Form 

A minterm in the variables X\, X2, . . . , x n is the AND of each variable or its negation. For 
example, when n = 3, Xi A X2 A X3 is a minterm. It has value 1 exactly when each variable 
has value 0. X\ A X2 A X3 is another minterm; it has value 1 exactly when Xi = 1, X2 = and 
X3 = 1 . It follows that a minterm on n variables has value 1 for exactly one of the 2™ points 
in its domain. Using the notation x = x and x° = x, we see that the above min terms can 
be written as X1X2X3 and X1X2X3, respectively, when we drop the use of the AND operator A. 
Thus, XjXjXj = 1 when x = (xi,X2,X3) = (0,0,0) and x|x°X3 = 1 when x = (1,0, 1). 
That is, the minterm £C( C ) = Xj 1 A x c 2 2 A • ■ ■ A x^" has value 1 exactly when x = c where c = 
\C\, C2, . . • , c„). A minterm of a Boolean function / is a minterm Xi c \ that contains all the 
variables of/ and for which f(c) = 1. 

The word "disjunction" is a synonym for OR, and the disjunctive normal form (DNF) of 
a Boolean function / : B n 1— > B is the OR of the minterms of/. Thus, / has value 1 when 
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Figure 2.3 Truth tables illustrating the disjunctive and conjunctive normal forms. 



exactly one of its minterms has value 1 and has value otherwise. Consider the function whose 
table is given in Fig. 2.3(a). Its disjunctive normal form (or minterm expansion) is given by 
the following formula: 

7\ J-* 1) Ju')) Ju'X) Ju i Ju ■J JL 5 V X] X"lXi V Ju 1 Ju ■J Ju 2 V Ju 1 Ju <1 Ju 2 V X| XlXl 

The parity function /™ : B n h- > B on n inputs has value 1 when an odd number of 
inputs is 1 and value otherwise. It can be realized by a circuit containing n — 1 instances of 
the EXCLUSIVE OR operator; that is, f@ (xi, . . . , x n ) = X\ © x 2 © • ■ ■ © x n . However, the 



p(«) 



(3) 



DNF of fm contains 2™ minterms, a number exponential in n. The DNF of /m is 

(3) 

/m (x,y,z) =xyz V xyz V xyz V xyz 
Here we use the standard notation for a variable and its complement. 

2.3.2 Conjunctive Normal Form 

A maxterm in the variables X\, x 2 , ■ ■ ■ , x n is the OR of each variable or its negation. For 
example, X\ V x 2 V x 3 is a maxterm. It has value exactly when X\ = x 2 = and £3 = 1. 
X\ V x 2 V 23 is another maxterm; it has value exactly when X\ = and x 2 = X3 = 1. 
It follows that a maxterm on n variables has value for exactly one of the 2" points in its 
domain. We see that the above maxterms can be written as x\ V x\ V x\ and x\ V x\ V x®, 
respectively. Thus, x\ V x\ V a;" = when a; = (xi, Xj, X3) = (0, 0, 1) and x\ V x° V x\ = 
when a; = (0, 1, 0). That is, the maxterm x < - c ' > = 1J 1 V 1^ V ■ • • V x^j" has value exactly 
when x = c. A maxterm of a Boolean function / is a maxterm x^ c ' that contains all the 
variables of/ and for which /(c) = 0. 

The word "conjunction" is a synonym for AND, and the conjunctive normal form (CNF) 
of a Boolean function / : B n 1— > B is the AND of the maxterms of/. Thus, / has value 
when exactly one of its maxterms has value and has value 1 otherwise. Consider the function 
whose table is given in Fig. 2.3(b). Its conjunctive normal form (or maxterm expansion) is 
given by the following formula: 

f(x u x 2 , x 3 ) = (x\ V x\ V x°) A (x\ V x° 2 V x°) A (x° V x°V x\) 
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An important relationship holds between the DNF and CNF representations for Boolean 
functions. If DNF(/) and CNF(/) are the representations of/ in the DNF and CNF expan- 
sions, then the following identity holds (see Problem 2.6): 



CNF(/) = DNF(/) 

It follows that the CNF of the parity function f^' has 2"~ 

Since each function / : B n i— > B m can be expanded to its CNF or DNF and each can be 
realized with circuits, the following result is immediate. 

THEOREM 2.3. 1 Every function / : B n t— > B m can be realized by a logic circuit. 

2.3.3 SOPE and POSE Normal Forms 

The sum-of-products and product-of-sums normal forms are simplifications of the disjunctive 
and conjunctive normal forms, respectively. These simplifications are obtained by using the 
rules stated in Section 2.2.4. 

A product in the variables CEj,, Xt 2 , . . . , Xi k is the AND of each of these variables or their 
negations. For example, x-i x$ X(, is a product. A minterm is a product that contains each of 
the variables of a function. A product of a Boolean function / is a product in some of the 
variables of/. A sum-of-products expansion (SOPE) of a Boolean function is the OR (the 
sum) of products of/. Thus, the DNF is a special case of the SOPE of a function. 

A SOPE of a Boolean function can be obtained by simplifying the DNF of a function 
using the rules given in Section 2.2.4. For example, the DNF given earlier and shown below 
can be simplified to produce a SOPE. 

yi(xi,X2,Xi,) = X\ X2X3 V X\ X2X3 V X\ X2X3 V X\ x 2 x i V Xi X2X3 

It is easy to see that the first and second terms combine to give x 12T3, the first and third give 
2F2X3 (we use the property that g V g = g), and the last two give X\X^. That is, we can write 
the following SOPE for /: 

/ = X\ X} V X\ 2F3 V X2X3 (2.3) 

Clearly, we could have stopped before any one of the above simplifications was used and gen- 
erated another SOPE for /. This illustrates the point that a Boolean function may have many 
SOPEs but only one DNF. 

A sum in the variables Xi^X^,..., Xi k is the OR of each of these variables or their nega- 
tions. For example, 2S3 V X4 V Xy is a sum. A maxterm is a product that contains each of the 
variables of a function. A sum of a Boolean function / is a sum in some of the variables of 
/. A product-of-sum expansion (POSE) of a Boolean function is the AND (the product) of 
sums of/. Thus, the CNF is a special case of the POSE of a function. 

A POSE of a Boolean function can be obtained by simplifying the CNF of a function 
using the rules given in Section 2.2.4. For example, the conjunction of the two maxterms 
Xi V x 2 V £3 and X\ V X2 V x$, namely {x\ V^V 2:3) A {x\ V^V 2)3), can be reduced to 
X\ V X2 by the application of rules of Section 2.2.4, as shown below: 



{x\ V 2)2 V £3) A (2:1 V 2; 2 V 2:3) = 
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= X\ V (#2 Vij)A (x 2 V x 3 ) {3rd distributivity rule} 

= Xi V a?2 V (x 3 A £3) {3rd distributivity rule} 

= x 1 V x~2 V {6th absorption rule} 

= X\ V X2 {1st rule on substitution of constants} 

It is easily shown that the POSE of the parity function is its CNF. (See Problem 2.8.) 

2.3.4 Ring-Sum Expansion 

The ring-sum expansion (RSE) of a function / is the EXCLUSIVE OR (ffi) of a constant 
and products (A) of unnegated variables of /. For example, 1 ffi X1X3 ffi X2X4 is an RSE. 
The operations ffi and A over the set B = {0, 1} constitute a ring. (Rings are examined in 
Section 6.2.1.) Any two instances of the same product in the RSE can be eliminated since they 
sum to 0. 

The RSE of a Boolean function / : B n i— > B can be constructed from its DNF, as we 
show. Since a minterm of / has value 1 on exactly one of the 2 n points in its domain, at 
most one minterm in the DNF for / has value 1 for any point in its domain. Thus, we 
can combine minterms with EXCLUSIVE OR instead of OR without changing the value of the 
function. Now replace X, with cc, ffi 1 in each minterm containing Xi and then apply the 
second distributivity rule. We simplify the resulting formula by using commutativity and the 
absorption rule Xi ffi Xi = 0. For example, since the minterms of (x\ V x 2 )x^ are X1X2X3, 
X1X2X3, and X1X2X3, we construct the RSE of this function as follows: 

(x~l V £2)^3 = X1X2X3 ffi X1X2X3 © X1X2X3 

= (Xi © 1)X2X 3 © (Xi © 1)(X2 © 1)X3 © X1X2X3 

= X2X3 © X1X2X3 © X3 © X1X3 © X2X3 © X1X2X3 © X1X2X3 
= X3 © X1X3 © X1X2X3 

The third equation follows by applying the second distributivity rule and commutativity. The 
fourth follows by applying Xi © Xi = and commutativity. The two occurrences of X2X3 are 
canceled, as are two of the three instances of x 1X2X3. 

As this example illustrates, the RSE of a function / : B n i-> B is the EXCLUSIVE OR of 
a Boolean constant Co and one or more products of unnegated variables of /. Since each of 
the n variables of / can be present or absent from a product, there are 2™ products, including 
the product that contains no variables; that is, a constant whose value is or 1 . For example, 
1 © X3 © X1X3 © X1X2X3 is the RSE of the function (xi V X2) X3. 

2.3.5 Comparison of Normal Forms 

It is easy to show that the RSE of a Boolean function is unique (see Problem 2.7). However, the 
RSE is not necessarily a compact representation of a function. For example, the RSE of the OR 

(n) 

of n variables, /y , includes every product term except for the constant 1. (See Problem 2.9.) 
It is also true that some functions have large size in some normal forms but small size in 
others. For example, the parity function has exponential size in the DNF and CNF normal 
forms but linear size in the RSE. Also, / v has exponential size in the RSE but linear size in 
the CNF and SOPE representations. 
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A natural question to ask is whether there is a function that has large size in all five normal 
forms. The answer is yes. This is true of the Boolean function on n variables whose value is 1 
when the sum of its variables is modulo 3 and is otherwise. It has exponential-size DNF, 
CNF, and RSE normal forms. (See Problem 2.10.) However, its smallest circuit is linear in n. 
(See Section 2.11.) 



2.4 Reductions Between Functions 

A common way to solve a new problem is to apply an existing solution to it. For example, an 
integer multiplication algorithm can be used to square an integer by supplying two copies of 
the integer to the multiplier. This idea is called a "reduction" in complexity theory because we 
reduce one problem to a previously solved problem, here squaring to integer multiplication. In 
this section we briefly discuss several simple forms of reduction, including subfunctions. Note 
that the definitions given below are not limited to binary functions. 



DEFINITION 2.4.1 A function f : A" i- 
through application of the functions p : „4 S 



A m is a reduction to the function g : A r i— » A s 
-> A' 71 andq : A n i-> A r if for all x E A' 1 : 



f(x) = p(g(q(x))) 

As suggested in Fig. 2.4, it follows that circuits for q, g and p can be cascaded (the output 
of one is the input to the next) to form a circuit for /. Thus, the circuit size and depth of/, 
C(f) and D(f), satisfy the following inequalities: 



C(f) < C(p) - 
D(f) < D(p) 



C{g)- 
D{g) 



C(q) 
-D(q) 



A special case of a reduction is the subfunction, as defined below. 

DEFINITION 2.4.2 Let g : A n i— > A" 1 . A subfunction / ofg is a function obtained by assigning 
values to some of the input variables ofg, assigning (not necessarily unique) variable names to the 
rest, deleting and/or permuting some of its output variables. We say that f is a reduction to g via 
the subfunction relationship. 
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f(x)=p(g(q(x))) 



Figure 2.4 The function / is reduced to the function g by applying functions p and q to prepare 
the input to g and manipulate its output. 
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f(a,b) 



Figure 2.5 The subfunction / of the function g is obtained by fixing some input variables, 
assigning names to the rest, and deleting and/or permuting outputs. 



This definition is illustrated by the function / oxamp i c ( x i> x 2 > x i) — (z/i > J/2) i n Fig- 2.2. 



We form the subfunction j/j by deleting y 2 from / ( 



example V 
(3,2) 



example 



an 



dfi: 



xing X\ = a, X2 



1, and 



X3 = b, where a and b are new variables. Then, consulting (2.3), we see that j/i can be written 
as follows: 



y x = (06) V (ab) V (16) 
= abVab 
= o® b® 1 

That is, ?/i contains the complement of the EXCLUSIVE OR function as a subfunction. The 
definition is also illustrated by the reductions developed in Sections 2.5.2, 2.5.6, 2.9.5, and 
2.10.1. 

The subfunction definition derives its importance from the following lemma. (See Fig. 2.5.) 

LEMMA 2.4. 1 Iff is a subfunction of g, a straight-line program for f can be created fiom one 
for g without increasing the size or depth of its circuit. 

As shown in Section 2.9.5, the logical shifting function (Section 2.5.1) can be realized 
by composing the integer multiplication and decoder functions (Section 2.5). This type of 
reduction is useful in those cases in which one function is reduced to another with the aid of 
functions whose complexity (size or depth or both) is known to be small relative to that of 
either function. It follows that the two functions have the same asymptotic complexity even if 
we cannot determine what that complexity is. The reduction is a powerful idea that is widely 
used in computer science. Not only is it the essence of the subroutine, but it is also used to 
classify problems by their time or space complexity. (See Sections 3.9.3 and 8.7.) 



2.5 Specialized Circuits 



A small number of special functions arise repeatedly in the design of computers. These include 
logical and shifting operations, encoders, decoders, multiplexers, and demultiplexers. In the 
following sections we construct efficient circuits for these functions. 
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(§) (®) □ □ 



Figure 2.6 A balanced binary tree circuit that combines elements with an associative operator. 



2.5.1 Logical Operations 



Logical operations are not only building blocks for more complex operations, but they are 
at the heart of all central processing units. Logical operations include "vector" and "asso- 
ciating" operations. A vector operation is the component-wise operation on one or more 
vectors. For example, the vector NOT on the vector x = (x n —\, . . . ,X\,Xq) is the vector 
x = (x n _i, . . . ,W\, Xq). Other vector operations involve the application of a two-input func- 
tion to corresponding components of two vectors. If * is a two-input function, such as AND 
or OR, and x = (x n -i, . . . , X\, xo) and y = (y n -i, . . . ,y\, j/o) are two n-tuples, the vector 
operation x * y is 

x-ky = (x„-i *y n - U .. .,xi *y\,x *yo) 

An associative operator over a A satisfies the condition (aQb)Qc = aQ(bQc) for all 
a,b,c £ A. A summing operation on an n-tuple x with an associative two-input operation 
produces the "sum" y defined below. 

y = x n -i ■ ■ • ©xi Q x 

An efficient circuit for computing y is shown in Fig. 2.6. It is a binary tree whose leaves are 
associated with the variables cc n _i, . . . ,X\,Xq. Each level of the tree is full except possibly 
the last. This circuit has smallest depth of those that form the associative combination of the 
variables, namely [log 2 n\ . 

2.5.2 Shifting Functions 

Shifting functions can be used to multiply integers and generally manipulate data. A cyclic 
shifting function rotates the bits in a word. For example, the left cyclic shift of the 4-tuple 
(1, 0, 0, 0) by three places produces the 4-tuple (0, 1, 0, 0). 

The cyclic shifting function / c ™ lic : B n+ ^ log2 n > i— > B n takes as input an n-tuple x = 
(x n -\, . . . ,X\, Xq) and cyclically shifts it left by \s\ places, where |s| is the integer associated 
with the binary fc-tuple s = (sk-\, ■ ■ ■ ,S\, So), k = [iog 2 n\, and 

fc-i 

3=0 

The n-tuple that results from the shift is y = (y n _i, . . . , J/i, J/o)> denoted as follows: 
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Figure 2.7 Three stages of a cyclic shifting circuit on eight inputs. 



A convenient way to perform the cyclic shift of a; by \s\ places is to represent \s\ as a sum 
of powers of 2, as shown above, and for each < j < k — 1, shift x left cyclically by Sj2 J 
places, that is, by either or 2 ] places depending on whether Sj = or 1. For example, 
consider cyclically shifting the 8-tuple u = (1, 0, 1, 1, 0, 1, 0, 1) by seven places. Since 7 is 
represented by the binary number (1,1,1), that is, 7 = 4 + 2+1, to shift (1, 0, 1, 1, 0, 1, 0, 1) 
by seven places it suffices to shift it by one place, by two places, and then by four places. (See 
Fig. 2.7.) 

For < r < n — 1, the following formula defines the value of the rth output, y r , of a 
circuit on n inputs that shifts its input x left cyclically by either or 2 3 places depending on 
whether Sj = or 1 : 

y r = {x r A Sj) V (x (r _ 2J) mod „ A Sj) 

Thus, y r is x r in the first case or x^ r _ 2 i) mo d n m tne second. The subscript (r — 2 3 ) mod n 
is the positive remainder of (r — 2 ] ) after division by n. For example, if n = 4, r = 1, and 
j = 1, then (r — 2 3 ) = —1, which is 3 modulo 4. That is, in a circuit that shifts by either 
or 2 1 places, 2/1 is either X\ or x 3 because £3 moves into the second position when shifted left 
cyclically by two places. 

A circuit based on the above formula that shifts by either or 2 J places depending on 
whether Sj = or 1 is shown in Fig. 2.8 for n = 4. The circuit on n inputs has 3n + 1 gates 
and depth 3. 

It follows that a circuit for cyclic shifting an n-tuple can be realized in k = [log 2 n\ stages 
each of which has 3n + 1 gates and depth 3, as suggested by Fig. 2.7. Since this may be neither 
the smallest nor the shallowest circuit that computes / cyclic : B n+ ' Sl "' , its minimal circuit 
size and depth satisfy the following bounds. 
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Figure 2.8 One stage of a circuit for cyclic shifting four inputs by or 2 places depending on 
whether S\ = or 1. 



LEMMA 2.5. 1 The cyclic shifting function /£lL : £«+ri°S2™l 



' cyclic 



circuit of the following size and depth over the basis Qrj = {A, V, -i}; 



B n can be realized by a 



(n) 
cyclic 



The logical shifting function / shift 



Cn (/, 



D I f^ 

ly il \ J C y C li C 



(n) 



< (3n + 1) [log 2 ri\ 

< 3riog 2 n] 



B n+ riog.nl ^ B n shifc left (he n . tup l e x b y 

a number of places specified by a binary [log n] -tuple s, discarding the higher-index com- 
ponents, and filling in the lower-indexed vacated places with O's to produce the n-tuple y, 
where 



!Jj 



Xj-\ s \ for \s\ < j < n ■ 
otherwise 



1 



REDUCTIONS BETWEEN LOGICAL AND CYCLIC SHIFTING The logical shifting function f^) {t : 
B n+ r iog 2 n > i— > B n on the n-tuple x is defined below in terms of /i," lic and the "projection" 



» 



function ir^ 1 ' : B 2n i— > B n that deletes the n high order components from its input 2n-tuple. 
Here denotes the zero binary n-tuple and • x denotes the concatenation of the two strings. 
(See Figs. 2.9 and 2.10.) 



/&(*.•) =4 n) (/&(° •*•*)) 
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Figure 2.9 The reduction of / slll ' ft to / C X C obtained by cyclically shifting • 03 by three places 
and projecting out the shaded components. 



©John E Savage 



2.5 Specialized Circuits 



51 



Jn) 

■/cyclic 





























,(2n) 
J shift 

























































































Figure 2. 1 The function f c ' ' lic is obtained by computing 
low-order bits. 



Mn) 



if j on XX and truncating the n 



O) 



A2n) 



LEMMA 2.5.2 The function / cyc i ic contains / shift as a subfunction and the function / sni ft con- 

r{n) j r 

tains J cyc j ic as a subfunction. 

Proof The first statement follows from the above argument concerning / sni f t . The second 
statement follows by noting that 



/Sl(*.-)=^ ) (/£8(« 



» 



where it H deletes the n low-order components of its input. ■ 

This relationship between logical and cyclic shifting functions clearly holds for variants 
of such functions in which the amount of a shift is specified with some other notation. An 
example of such a shifting function is integer multiplication in which one of the two arguments 
is a power of 2. 



2.5.3 Encoder 



The encoder function / Gncodc : B 2 h- > B n has 2" inputs, exactly one of which is 1 . Its 
output is an n-tuple that is a binary number representing the position of the input that has 
value 1 . That is, it encodes the position of the input bit that has value 1 . Encoders are used in 
CPUs to identify the source of external interrupts. 

Let x = (x2"_i, . . . ,X2, X\,Xq) represent the 2" inputs and let y = {y n -\, ■ ■ ■ , 2/i>2/o) 
represent the n outputs. Then, we write /encode ( !E ) = V- 

When n = 1, the encoder function has two inputs, X\ and Xq, and one output, j/o> whose 
value is j/o = %\ because if xq = 1, then X\ = and yo = is the binary representation of 
the input whose value is 1. Similar reasoning applies when Xq = 0. 

When n > 2, we observe that the high-order output bit, y n -i> has value 1 if 1 falls among 
the variables X2™-i, ■ ■ • , x 1 n-u r \, x 2 n-\ . Otherwise, y n -\ = 0. Thus, y n -\ can be computed 
as the OR of these variables, as suggested for the encoder on eight inputs in Fig. 2.11. 

The remaining n — 1 output bits, y n -2> ■ ■ ■ > J/1. 2/0j represent the position of the 1 among 
variables a: 2 "-i_i, . . . , x 2 , X\,Xq if y n -i = or the 1 among variables a^-i, ■ ■ ■ ,£2™->+i' 
X2«-i if y n -\ = 1. For example, for n = 3 if x = (0, 0, 0, 0, 0, 0, 1,0), then yi = and 
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Figure 2. 1 I The recursive construction of an encoder circuit on eight inputs. 



(y u y ) = (0,1), whereas if x = (0,0,1,0,0,0,0,0), then y 2 = 1 and (yi,y Q ) = (0,1). 
Thus, after computing y n -\ as the OR of the 2" _1 high-order inputs, the remaining output 
bits can be obtained by supplying to an encoder on 2"" 1 inputs the 2 n_1 low-order bits if 
y n -\ = or the 2 n ~ 1 high-order bits if y n -\ = 1. It follows that in both cases we can 
supply the vector 6 = (x 2 ^-i V x 2 ( n -\)_ l ,x 2 r>_ 2 V x 2 ( n -^_ 2 , • ■ -,x 2 ( n -D V x ) of 2^ n ~^ 
components to the encoder on 2^ n ~ ' inputs. This is illustrated in Fig. 2.1 1. 

Let's now derive upper bounds on the size and depth of the optimal circuit for / ™ codc - 

Clearly Cn ( / oncodo J = and D n „ ( /encode) = °> since no S ates are nee ded in this case. 
From the construction described above and illustrated in Fig. 2. 1 1 , we see that we can construct 
a circuit for / oncodo in a two-step process. First, we form y n -\ as the OR of the 2 ,l_1 high- 
order variables in a balanced OR tree of depth n using 2 n_1 — 1 OR's. Second, we form 
the vector S with a circuit of depth 1 using 2 n ~ 1 OR's and supply it to a copy of a circuit 

f° r /encode- This P rov ides the following recurrences for the circuit size and depth of / c " codo 
because the depth of this circuit is no more than the maximum of the depth of the OR tree and 
1 more than the depth of a circuit for / cn ' codc : 

Cn„ (^/encode J - 2 

(/intde) < ™x(n 



l + Cn (/ ( 



(n-lK 

encode/ 



A 



Li>Oo (/encode) 



1 



(2.4) 

(2.5) 



The solutions to these recurrences are stated as the following lemma, as the reader can show. 
(See Problem 2.14.) 



LEMMA 2.5.3 The 



unction / cncodo has the following circuit size and 
Co (/ O ( ;l dc ) <2" +1 -(n + 3) 



D 



Ho \f encode) - n 
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2.5.4 Decoder 

A decoder is a function that reverses the operation of an encoder: given an n-bit binary address, 
it generates 2™ bits with a single 1 in the position specified by the binary number. Decoders 
are used in the design of random-access memory units (see Section 3.5) and of the multiplexer 
(see Section 2.5.5). 



The decoder function / 



(n) 

decode 



B r 



B has n input variables x = (x r , 



,Xi,x ) 



and 2" output variables y = (y 2 "-i» ■ • ■ , 2/i> 2/o); that is, f^ codc {x) = y. Let c be a binary 
n-tuple corresponding to the integer \c\. All components of the binary 2™-tuple y are zero 
except for the one whose index is \c\, namely g/i c i . Thus, the minterm functions in the variables 

x are computed as the output of /j" co dc- 

A direct realization of the function /^™ codc can be obtained by realizing each minterm 
independently. This circuit uses (2n — 1)2™ gates and has depth [log 2 n\ + 1. Thus we have 
the following bounds over the basis Qo = {A, V, -i}: 

Ca (/il de ) <(2n-l)2» 

Dn (/ilae) <[lo g2 nl+l 

A smaller upper bound on circuit size and depth can be obtained from the recursive con- 
struction of Fig. 2.12, which is based on the observation that a minterm on n variables is the 
AND of a minterm on the first n/2 variables and a minterm on the second n/2 variables. For 
example, when n = 4, the minterm X} A x 2 A X\ A Xq is obviously equal to the AND of the 
minterm x^hxj in the variables 2)3 and X2 and the minterm X\ Axq in the variables X\ and Xq. 
Thus, when n is even, the minterms that are the outputs of /^" codc can be formed by ANDing 



f (4) 
J decode 



Vi V& 2/5 2/4 2/3 2/2 2/1 2/0 




V} V 2 Vi Vq 



f (2) 
■/decode 



u 3 u 2 u x U Q 



1 decode 



£3 X2 X\ Xq 

Figure 2. 1 2 The construction of a decoder on four inputs from two copies of a decoder on two 
inputs. 
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every minterm generated by a circuit for /decode on the variables x n / 2 -\, ■ ■ ■ , Xq with every 

minterm generated by a circuit for f^odc on tne var i arj l es X n -l> ■ ■ ■ > x n/2' as suggested in 
Fig. 2.12. 

The new circuit for /decode nas a s ' ze tnat is at most twice that of a circuit for /^ oco j 
plus 2" for the AND gates that combine minterms. It has a depth that is at most 1 more than 
the depth of a circuit for /^ 
circuit size and depth of f d ™' codc : 



the depth of a circuit for /^" co ^ c . Thus, when n is even we have the following bounds on the 



Cfi (/< 



(«) \ <- 9^ f f( n / 2 ) \ 
decode j — z °f2o I ./decode j "+" z 



^O (^/decode J — ^o y decode ) + * 

Specializing the first bounds given above on the size and depth of a decoder circuit to one on 
n/2 inputs, we have the bound in Lemma 2.5.4. Furthermore, since the output functions are 

all different, Cn a (/decode) is at least 2 "- 

LEMMA 2.5.4 For n even the decoder function /decode has the following circuit size and depth 
bounds: 

2" < Cn (/il de ) <2" + (2n-2)2"/ 2 

Dn (/t'odc) <[log 2 nl + l 

The circuit size bound is linear in the number of outputs. Also, for n > 12, the exact value of 
Cn ( /d cco do ) ls known to within 25%. Since each output depends on n inputs, we will see 
in Chapter 9 that the upper bound on depth is exactly the depth of the smallest depth circuit 
for the decoder function. 



2.5.5 Multiplexer 

The multiplexer function /„n lx : B 2 +n i— > B has two vector inputs, z = (z2*»-i> ■ • ■ > Z\> 
z ) and x = (x„-\, . . . ,x\,Xo), where x is treated as an address. The output of /m UX is 
V = Zj, where j = \x\ is the integer represented by the binary number x. This function is 
also known as the storage access Function because it simulates the access to storage made by a 
random-access memory with one-bit words. (See Section 3.5.) 

The similarity between this function and the decoder function should be apparent. The 
decoder function has n inputs, x = (x„_i, . . . ,X\, x Q ), and 2" outputs, y = (?/2 n -i» • • ■ > 2/i> 
yo), where yj = 1 if j = \x\ and j/j = otherwise. Thus, we can form v = Zj as 

v = (z 2 «-i A y 2 n-i) V • • • V Oi A j/i) V (z A y ) 

(n) 

This circuit uses a circuit for the decoder function /d oco d c P^ us z ™ AND gates and 2™ — 1 
OR gates. It adds a depth of n + 1 to the depth of a decoder circuit. Lemma 2.5.5 follows 
immediately from these observations. 



©John E Savage 2.6 Prefix Computations 55 

LEMMA 2.5.5 The multiplexer function /iji : B 2 +n t—> B can be realized with the following 
circuit size and depth over the basis Cl = {A, V, -i} : 

Ca (ftl) <3-2" + 2(n-l)2"/ 2 -l 

DnJf^l) <n+riog 2 nl+2 



Using the lower bound of Theorem 9.3.3, one can show that it is impossible to reduce 
the upper bound on circuit size to less than 2 n+1 — 2. At the cost of increasing the depth by 
1, the circuit size bound can be improved to about 2 n+1 . (See Problem 2.15.) Since /mux 
depends on 2" + n variables, we see from Theorem 9.3.1 that it must have depth at least 
log 2 (2™ + n) > n. Thus, the above depth bound is very tight. 

2.5.6 Demultiplexer 

The demultiplexer function / d ™ mux : B n+l i— > B 2 is very similar to a decoder. It has n + 1 
inputs consisting of n bits, x, that serve as an address and a data bit e. It has 2" outputs y all 
of which are if e = and one output that is 1 if e = 1, namely the output specified by the 
n address bits. Demultiplexers are used to route a data bit (e) to one of 2™ output positions. 

A circuit for the demultiplexer function can be constructed as follows. First, form the AND 
of e with each of the n address bits X n -\ , . . . , X\ , Xq and supply this new n-tuple as input to 
a decoder circuit. Let z = (z2"-i, • • • . Z\> Zq) be the decoder outputs. When e = 0, each of 
the decoder inputs is and each of the decoder outputs except zq is and zq = 1. If we form 
the AND of Zq with e, this new output is also when e = 0. If e = 1, the decoder input is the 
address x and the output that is 1 is in the position specified by this address. Thus, a circuit 
for a demultiplexer can be constructed from a circuit for / d " codc to which are added n AND 
gates on its input and one on its output. This circuit has a depth that is at most 2 more than 
the depth of the decoder circuit. Since a circuit for a decoder can be constructed from one 
for a demultiplexer by fixing e = 1, we have the following bounds on the size and depth of a 

circuit for / d " mux . 

LEMMA 2.5.6 The demultiplexer function /(j omux : B n+1 \— > B 2 can be realized with the 
following circuit size and depth over the basis Qo = {A, V, -i}; 

< Ca (f ( d :L x ) ~ Cn (/&L*) < n+ 1 

< D no (y d " mux J - Dq (/ d " oder J < 2 



2.6 Prefix Computations 



The prefix computation first appeared in the design of logic circuits, the goal being to paral- 
lelize as much as possible circuits for integer addition and multiplication. The carry-lookahead 
adder is a fast circuit for integer addition that is based on a prefix computation. (See Sec- 
tion 2.7.) Prefix computations are now widely used in parallel computation because they 
provide a standard, optimizable framework in which to perform computations in parallel. 

The prefix function Vq : A n *— > A n on input x = {x\,Xz, ■ ■ -,X n ) produces as 
output y = (yu yj, ■ ■ ■ , y n )> which is a running sum of its n inputs x using the operator 
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© as the summing operator. That is, j/j = Xy X2 ■ ■ ■ Xj for 1 < j < n. Thus, if 

(n) 

the set A is IN, the natural numbers, and is the integer addition operator +, then V + 
on the input x = (x\, X2, ■ ■ ■ , x n ) produces the output y, where y\ = X\, yj = X\ + X2, 
2/3 = X\ + X2 + £3, . . . , y n = Xi + x 2 + ■ ■ ■ + x n . For example, shown below is the prefix 
function on a 6-vector of integers under integer addition. 

x= (2,1,3,7,5,1) 
vf\x) = (2,3,6,13,18,19) 

A prefix function is defined only for operators that are associative over the set A. An 
operator over A is associative if a) for all a and b in A, a b is in A, and b) for all a, b, and 
c in A, (a 6) c = a© (6©c) — that is, if all groupings of terms in a sum with the operator 

have the same value. A pair (A, 0) in which is associative is called a semigroup. Three 
semigroups on which a prefix function can be defined are 

• (IN, +) where IN are the natural numbers and + is integer addition. 

• ({0, 1}*, •) where {0, 1}* is the set of binary strings and • is string concatenation. 

• (.4, ©copy) where A is a set and © CO py is defined by a CO py b = a. 

It is easy to show that the concatenation operator • on {0, 1}* and CO py on a set A are 
associative. (See Problem 2.20.) Another important semigroup is the set of matrices under 
matrix multiplication (see Theorem 6.2.1). 

Summarizing, if (^4, ©) is a semigroup, the prefix function V@ : A n 1— > A n on input 
x = (x\, X2, ■ • ■ , x n ) produces as output y = (j/i, y 2 , ■ ■ ■ , y n )> where yj = X\QX2®- ■ -G>Xj 
for 1 < j < n. 

Load balancing on a parallel machine is an important application of prefix computation. 
A simple example of load balancing is the following: We assume that p processors, numbered 
from to p — 1, are running processes in parallel. We also assume that processes are born 
and die, resulting in a possible imbalance in the number of processes active on processors. 
Since it is desirable that all processors be running the same number of processes, processes 
are periodically redistributed among processors to balance the load. To rebalance the load, a) 
processors are given a linear order, and b) each process is assigned a Boolean variable with value 

1 if it is alive and otherwise. Each processor computes its number of living processes, n^. A 
prefix computation is then done on these values using the linear order among processors. This 
computation provides the jxh processor with the sum Uj + Tij—i + ■ ■ • + flj which it uses to 
give each of its living processes a unique index. The sum n = n p + ■ • ■ + ri\ is then broadcast 
to all processors. When the processors are in balance all have \n/p~\ processes except possibly 
one that has fewer processes. Assigning the sth process to processor (s mod p) insures that 
the load is balanced. 

Another important type of prefix computation is the segmented prefix computation. In 
this case two n-vectors are given, a value vector x and a flag vector (p. The value of the ith 
entry yi in the result vector y is Xj if 4>i is 1 and otherwise is the associative combination with 
of Xi and the values between it and the first value Xj to the left of Xi for which the flag 
<f>j = 1. The first bit of cf> is always 1. An example of a segmented prefix computation is shown 
below for integer values and integer addition as the associative operation: 

x = (2,1,3,7,5,1) 
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cj)= (1,0,0,1,0,1) 
y= (2,3,6,7,12,1) 

As shown in Problem 2.21, a segmented prefix computation is a special case of a general prefix 
computation. This is demonstrated by defining a new associative operation on value-flag 
pairs that returns another value-flag pair. 

2.6.1 An Efficient Parallel Prefix Circuit 

A circuit for the prefix function Vq can be realized with 0(n 2 ) instances of© if for each 
1 < j < nwe naively realize yj = X\ Xi ■ • • Xj with a separate circuit containing j — 1 
instances of 0. If each such circuit is organized as a balanced binary tree, the depth of the 
circuit for Vq is the depth of the circuit for y n , which is [log 2 n\ . This is a parallel circuit 
for the prefix problem but uses many more operators than necessary. We now describe a much 
more efficient circuit for this problem; it uses O(n) instances of and has depth 0(log n). 

To describe this improved circuit, we let x[r, r] = x r and for r < s let x[r, s] = x r 
x r+ i ■ • • x s . Then we can write Vq (x) = y where yj = x[l,j]. 

Because is associative, we observe that x[r, s] = x[r, t] x[t + 1, s] for r < t < s. 
We use this fact to construct the improved circuit. Let n = 2 . Observe that if we form the 
(n/2) -tuple {x[l,2\, x[3,4], x[5,6], . . . , x[2 k - \,2 k }) using the rule x[i,i + 1] = x[i,i] 
x[i + \,i + 1] for i odd and then do a prefix computation on it, we obtain the (n/2)-tuple 
(x[l, 2], x[l, 4], x[l, 6], . . . , x[l, 2 ]). This is almost what is needed. We must only compute 
x[l, 1], x[l, 3], x[l, 5], . • • , x[\,2 k — 1], which is easily done using the rule x[l,2i + 1] = 
x[l,2i] X2i+\ for 1 < i < 2 — 1. (See Fig. 2.13.) The base case for this construction is 
that of n = 1, for which y\ = X\ and no operations are needed. 

If C(k) is the size of this circuit on n = 2" inputs and D{k) is its depth, then C(0) = 0, 
D(0) = and C{k) and D{k) for k > 1 satisfy the following recurrences: 

C{k) = C{k- l)+2 fe -l 
D{k) = D(k- l) + 2 

As a consequence, we have the following result. 

THEOREM 2.6. 1 Forn = 2 , k an integer, the parallel prefix function Vq : A n i— > A n on an 

n-vector with associative operator can be implemented by a circuit with the following size and 
depth bounds over the basis Q = {©}: 

Cn(v { Q ] ) <2n-\og 2 n-2 
Dn(v { Q ] ) <21og 2 n 

Proof The solution to the recurrence on C{k) is C{k) = 2 +1 — k — 2, as the reader can 
easily show. It satisfies the base case of k = and the general case as well. The solution to 
D(k) is D(k) = 2k. m 

When n is not a power of 2, we can start with a circuit for the next higher power of 2 and 
then delete operations and edges that are not used to produce the first n outputs. 
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a;[l,2] x[l,4] x[l,6] x[l,i 



;[!,!] x[\,3] x[l,5] x[l,7] 



P 



(n/2) 



P 



(n/4) 



Models of Computation 




#1 £2 ^3 ^4 ^5 x 6 x 7 X S 

Figure 2. 1 3 A simple recursive construction of a prefix circuit when n = 2 = 8. The gates 
used at each stage of the construction are grouped into individual shaded regions. 



2.7 Addition 

Addition is a central operation in all general-purpose digital computers. In this section we 
describe the standard ripple adder and the fast carry-lookahead addition circuits. The ripple 
adder mimics the elementary method of addition taught to beginners but for binary instead of 
decimal numbers. Carry-lookahead addition is a fast addition method based on the fast prefix 
circuit described in the preceding section. 

Consider the binary representation of integers in the set {0, 1, 2, . . . , 2" — 1}. They are 
represented by binary n-tuples u = (u n —\, w n _2, . . . , U\, Uq) and have value 

n-l 
3=0 

-> B n+1 computes the sum of two binary n-bit 
denotes integer addition: 



where J^ denotes integer addition. 

The addition function /^ : B 2n 

numbers u and V, as shown below, where 



n—l 

\ v \ = J2( u i 

7=0 



Vj )2> 



The tuple ((u„_i +w„_i), (u„_ 2 + w„_2), . . . , (uq + vo)) is not a binary number because the 
coefficients of the powers of 2 are not Boolean. However, if the integer uq + vq is converted to 
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a binary number (ci, s ), where C\2 X + Sq2° = Uq + Vq, then the sum can be replaced by 

Tl-l 

\ u \ + \ v \ = ^2( u j + Vj)2 j + (ui + vi + c x )2 l + s 2° 

where the least significant bit is now Boolean. In turn, the sum U\ +- V\ +- C\ can be represented 
in binary by (ex, S\), where Cx2+- S\ = U\ +V\+ C\. The sum |m| + \v\ can then be replaced 
by one in which the two least significant coefficients are Boolean. Repeating this process on all 
coefficients, we have the ripple adder shown in Fig. 2.14. 

In the general case, the jth stage of a ripple adder combines the j th coefficients of each 
binary number, namely Uj and Vj, and the carry from the previous stage, Cj, and represents 
their integer sum with the binary notation {cj+\, Sj), where 



Here c J+ i 



c j+l 2 + Sj 
the number of 2's in the sum m„ 



Vj + Cj, is the carry into the (j + l)st stage 



and Sj, the number of l's in the sum modulo 2, is the external output from the jth stage. 
The circuit performing this mapping is called a full adder (see Fig. 2.15). As the reader can 
easily show by constructing a table, this circuit computes the function /fa : B 3 i— > B 2 , where 



/fa(«j 



(c. 



■3+1' 



is described by the following formulas: 



Pi 

Cj+l 
Si 



Uj ® Vj 

Uj A Vj 
(PjAcj) V gj 

Pj ® Cj 



(2.6) 



Here Pj and gj are intermediate variables with a special significance. If gj = 1, a carry is 
generated at the jth stage. If Pj = 1, a carry from the previous stage is propagated through 
the jth stage, that is, a carry-out occurs exactly when a carry-in occurs. Note that Pj and gj 
cannot both have value 1 . 

The full adder can be realized with five gates and depth 3. Since the first full adder has 
value for its carry input, three gates can be eliminated from its circuit and its depth reduced 
by 2. It follows that a ripple adder can be realized by a circuit with the following size and 
depth. 
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Figure 2.14 A ripple adder for binary numbers. 
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Figure 2.15 A full adder realized with gates. 
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THEOREM 2.7. 1 The addition function f aA & '■ B n i— > J g™+ 1 can be realized with a ripple adder 
with the following size and depth bounds over the basis fl = {A, V, ©}: 



C to (/add) 



add J 



< 5n 



< 3n 



(Do the ripple adders actually have depth less than 3n — 2?) 

2.7. 1 Carry-Lookahead Addition 

The ripple adder is economical; it uses a small number of gates. Unfortunately, it is slow. The 
depth of the circuit, a measure of its speed, is linear in n, the number of bits in each integer. 
The carry-lookahead adder described below is considerably faster. It uses the parallel prefix 
circuit described in the preceding section. 

The carry-lookahead adder circuit is obtained by applying the prefix operation to pairs 
in B 2 using the associative operator o : (B 2 ) 2 i— > B 2 defined below. Let (a, b) and (c, d) be 
arbitrary pairs in B 2 . Then o is defined by the following formula: 

(a, b) o (c, d) = (a A c, (b A c) V d) 

To show that o is associative, it suffices to show by straightforward algebraic manipulation that 
for all values of a, b, c, d, e, and / the following holds: 

((a, b) o (c, d)) o (e, /) = (a, b) o ((c, d) o (e, /)) 
= (ace, bceWdeWf) 

Let7r[j,j] = (j>j,gj) and,forj < fc,let7r[j,fc] = 7r[j,fc— l]o7r[fc,fc]. By induction it is 
straightforward to show that the first component of n[j, k] is 1 if and only if a carry propagates 
through the full adder stages numbered j,j + 1, . . . , k and its second component is 1 if and 
only if a carry is generated at the rth stage, j < r < k, and propagates from that stage through 
the kth stage. (See Problem 2.26.) 

The prefix computation on the string (7r[0, 0], 7r[l, 1], . . . , Tr[n — 1, n — 1]) with the op- 
erator o produces the string (ir[0, 0], 7r[0, 1], 7r[0, 2], . . . , 7r[0, n — 1]). The first component of 
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7r[0,j] is 1 if and only if a carry generated at the zeroth stage, c , is propagated through the 
jth stage. Since Co = 0, this component is not used. The second component of 7r[0, j], Cj+i, 
is 1 if and only if a carry is generated at or before the jth stage. From (2.6) we see that the 
sum bit generated at the jth stage, Sj, satisfies Sj = pj © Cj. Thus the jth output bit, Sj, is 
obtained from the EXCLUSIVE OR ofpj and the second component of 7r[0, j — 1]. 

THEOREM 2.7.2 For n = 2 k , k an integer, the addition function f^2 d '■ B 2n *-> B n+l can 
be realized with a carry-lookahead adder with the following size and depth bounds over the basis 
tt= {A,V,©}: 



Ca(f, 



(n) 

add 



< 8??, 



D, 



{f&l) <41og 2 n 



Proof The prefix circuit uses In — log 2 n — 3 instances of o and has depth 2 log 2 n. Since 
each instance of o can be realized by a circuit of size 3 and depth 2, each of these bounds is 
multiplied by these factors. Since the first component of 7r[0, j] is not used, the propagate 
value computed at each output combiner vertex can be eliminated. This saves one gate per 
result bit, or n gates. However, for each < j < n — 1 we need two gates to compute Pj 
and q,j and one gate to compute Sj, 3n additional gates. The computation of these three 
sets of functions adds depth 2 to that of the prefix circuit. This gives the desired bounds. ■ 

The addition function /^ is computed by the carry-lookahead adder circuit with 1.6 
times as many gates as the ripple adder but in logarithmic instead of linear depth. 

When exact addition is expected and every number is represented by n bits, a carry-out of 
the last stage of an adder constitutes an overflow, an error. 



2.8 Subtraction 

Subtraction is possible when negative numbers are available. There are several ways to repre- 
sent negative numbers. To demonstrate that subtraction is not much harder than addition, we 
consider the signed two's complement representation for positive and negative integers in the 
set 7L(n) = {—2™, . . . , —2, — 1, 0, 1, 2, . . . , 2™ — 1}. Each signed number u is represented by 
an (n + l)-tuple (a, u), where a is its sign and u = {u n —\, . . . , Uq) is a binary number that 
is either the magnitude |u| of the number u, if positive, or the two's complement 2" — |u| of 
it, if negative. The sign u is defined below: 

(0 the number u is positive or zero 
1 the number u is negative 

The two's complement of an n-bit binary number v is easily formed by adding 1 to t = 
2" — 1 — \v\. Since 2™ — 1 is represented as the n-tuple of l's, t is obtained by complementing 
(NOTing) every bit of v. Thus, the two's complement of it is obtained by complementing every 
bit of u and then adding 1 . It follows that the two's complement of the two's complement of 
a number is the number itself. Thus, the magnitude of a negative number (l, u) is the two's 
complement of it. 
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This is illustrated by the integers in the set Z(4) = { — 16, . . . , —2,-1,0, 1,2, . . ., 15}. 
The two's complement representation of the decimal integers 9 and — 1 1 are 

9 = (0,1,0,0,1) 
-11 = (1,0,1,0,1) 

Note that the two's complement of 11 is 16— 11 = 5, which is represented by the four- tuple 
(0, 1,0, 1 ) . The value of the two's complement of 1 1 can be computed by complementing all 
bits in its binary representation (1,0, 1, 1) and adding 1. 

We now show that to add two numbers u and v in two's complement notation (cr u , u) 
and (a v ,v), we add them as binary (n + 1) -tuples and discard the overflow bit, that is, the 
coefficient of 2 n+l . We now show that this procedure provides a correct answer when no 
overflow occurs and establish conditions on which overflow does occur. 

Let |u| and |v| denote the magnitudes of the two numbers. There are four cases for their 
sum u + v: 

Case u v u + v 



I > > |u| + |v| 

II > < 2" +1 + |u| - |v| 

III < > 2 n+l - juj + jvj 

IV < < 2" +1 + 2 n+1 - |u| - |v| 

In the first case the sum is positive. If the coefficient of 2™ is 1 , an overflow error is detected. 
In the second case, if |u| — |v| is negative, then 2 n+1 + |u| — |v| = 2™ + 2™ — ||u| — |v|| and 
the result is in two's complement notation with sign 1, as it should be. If |u| — |v| is positive, 
the coefficient of 2" is (a carry-out of the last stage has occurred) and the result is a positive 
number with sign bit 0, properly represented. A similar statement applies to the third case. 
In the fourth case, if |u| + |v| is less than 2", the sum is 2 n+l + 2" + (2" - (|u| + |v|)), 
which is 2" + (2™ — (|u| + |v|)) when the coefficient of 2 ra+1 is discarded. This is a proper 
representation for a negative number. However, if |u| + |v| > 2™, a borrow occurs from the 
(n + l)st position and the sum 2™ +1 + 2" + (2™ - (|u| + |v|)) has a in the (n + l)st 
position, which is not a proper representation for a negative number (after discarding 2 n+1 ); 
overflow has occurred. 

The following procedure can be used to subtract integer u from integer v: form the two's 
complement of u and add it to the representation for v. The negation of a number is obtained 
by complementing its sign and taking the two's complement of its binary n-tuple. It follows 
that subtraction can be done with a circuit of size linear in n and depth logarithmic in n. (See 
Problem 2.27.) 



2.9 Multiplication 



In this section we examine several methods of multiplying integers. We begin with the stan- 
dard elementary integer multiplication method based on the binary representation of numbers. 
This method requires 0(n 2 ) gates and has depth 0(log n) on n-bit numbers. We then ex- 
amine a divide-and-conquer method that has the same depth but much smaller circuit size. 
We also describe fast multiplication methods, that is, methods that have circuits with smaller 
depths. These include a circuit whose depth is much smaller than 0(\ogn). It uses a novel 
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representation of numbers, namely, the exponents of numbers in their prime number decom- 
position. 

The integer multiplication function f^ it '■ B 2n i— ► B 2n can be realized by the standard 
integer multiplication algorithm, which is based on the following representation for the 
product of integers represented as binary n-tuples u and V: 



MM = J2J2 u * v j 2 



i+J 



(2.7) 



i=0 j=0 



Here \u\ and \v\ are the magnitudes of the integers represented by u and v. The standard 
algorithm forms the products UiVj individually to create n binary numbers, as suggested below. 
Here each row corresponds to a different number; the columns correspond to powers of 2 with 
the rightmost column corresponding to the least significant component, namely the coefficient 
of 2°. 



>i 



■>o 



U «3 "0^2 UqVi UqVq = Z 

Ul«3 U\V2 U\V\ UiVq = Z\ 

U2V3 U2V2 U2V1 U2V0 = Z2 

U3U3 U3W2 U3W1 U}Vq = z 3 



(2.8) 



Let the ith binary number produced by this multiplication operation be 2j. Since each of 
these n binary numbers contains at most 2n — 1 bits, we treat them as if they were (2n — 1)- 
bit numbers. If these numbers are added in the order shown in Fig. 2.16(a) using a carry- 
lookahead adder at each step, the time to perform the additions, measured by the depth of a 
circuit, is 0{n log n). The size of this circuit is 0(n 2 ). A faster circuit containing about the 
same number of gates can be constructed by adding Zo, . . . , z„_i in a balanced binary tree 
with n leaves, as shown in Fig. 2.16(b). This tree has n — 1 (2n — l)-bit adders. (A binary 
tree with n leaves has n — \ internal vertices.) If each of the adders is a carry-lookahead adder, 
the depth of this circuit is 0(log n) because the tree has 0(log n) adders on every path from 
the root to a leaf. 
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Figure 2. 1 6 Two methods for aggregating the binary numbers z , . . . , z„_i . 
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2.9.1 Carry-Save Multiplication 

We now describe a much faster circuit obtained through the use of the carry-save adder. Let 
u, v, and w be three binary n-bit numbers. Their sum is a binary number t. It follows that 
\t\ can be represented as 

|t| = |«| + |«| + M 

n-l 

i=0 

With a full adder the sum (ui + V{ + Wi) can be converted to the binary representation 
Ci + \2 + Si. Making this substitution, we have the following expression for the sum: 



\u\ + \V\ + \W\ 
n-l 

-r Si)2 l 

Icl + lsl 



E( 2 ^ 



Here c with Co = is an (n + l)-tuple and s is an n-tuple. The conversion of (ui, Vi, Wi) to 
(cj+i, Si) can be done with the full adder circuit shown in Fig. 2.15 of size 5 and depth 3 over 
the basis Q = {A,V,®}. 



The function /< 



(t.) 



carry-save 



B ir 



gln+2 t j lat ma p S three binary n-tuples, u, v, and w, 



to the pair (c, s) described above is the carry-save adder. A circuit of full adders that realizes 
this function is shown in Fig. 2.17. 



B 2 "~K can y e rea li ze/ l with 



THEOREM 2.9. 1 The carry-save adder function /carry-save : B in 

the following size and depth over the basis ft = {A, V, ©}: 

Cn (/ir^-savc) < 5n 
D Q (f (n) \ < 3 

" W carry -save I — ** 

Three binary n-bit numbers u, v, w can be added by combining them in a carry-save 
adder to produce the pair (c, s), which are then added in an (n + l)-input binary adder. Any 
adder can be used for this purpose. 
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Figure 2.17 A carry-save adder realized by an array of full adders. 
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A multiplier for two n-bit binary can be formed by first creating the n (2n — l)-bit binary 
numbers shown in (2.8) and then adding them, as explained above. These n numbers can be 
added in groups of three, as suggested in Fig. 2.18. 

Let's now count the number of levels of carry-save adders in this construction. At the 
zeroth level there are mo = n numbers. At the j th level there are 



2|_mj_i/3j + m,j-i - 3[m,j-i/3\ 



-i - L TO i-i/3j 



binary numbers. This follows because there are [mj-i/3j groups of three binary numbers and 
each group is mapped to two binary numbers. Not combined into such groups are TTlj-i — 
[rrij-i/3\ binary numbers, giving the total rrij. Since (x — 2)/3 < \x/3\ < x/3, we have 

from which it is easy to show by induction that the following inequality holds: 



n < rrij < 



2 1 




Let s be the number of stages after which m s = 2. Since m s —i > 3, we have 



log 2 (n/2) 
log 2 (3/2) 



< s < 



log 2 n 

log 2 (3/2) 



1 



The number of carry-save adders used in this construction is n — 2. This follows from the 
observation that the number of carry-save adders used in one stage is equal to the decrease in 
the number of binary numbers from one stage to the next. Since we start with n and finish 
with 2, the result follows. 

After reducing the n binary numbers to two binary numbers through a series of carry-save 
adder stages, the two remaining binary numbers are added in a traditional binary adder. Since 
each carry-save adder operates on three (2n—l)-bit binary numbers, they use at most 5(2n— 1) 
gates and have depth 3. Summarizing, we have the following theorem showing that carry-save 
addition provides a multiplication circuit of depth O(logn) but of size quadratic in n. 
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Figure 2. 1 8 Schema for the carry-save combination of nine 18-bit numbers. 
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THEOREM 2.9.2 The binary multiplication function /^ lt : B 2n i— > B 2n for n-bit binary 
numbers can be realized by carry-save addition by a circuit of the following size and depth over 
the basis ft = {A, V,0}; 

Cn(/il) <5(2n-l)(n-2) + Cb(/ ) 2S ) ) 

where s, the number of carry-save adder stages, satisfies 



s < 



log 2 (3/2) 



It follows from this theorem and the results of Theorem 2.7.2 that two n-bit binary num- 
bers can be multiplied by a circuit of size 0(n 2 ) and depth 0(log n). 

2.9.2 Divide-and-Conquer Multiplication 

We now examine a multiplier of much smaller circuit size but depth 0(log n). It uses a 
divide-and-conquer technique. We represent two positive integers by their n-bit binary num- 
bers u and v. We assume that n is even and decompose each number into two (n/2)-bit 
numbers: 

u= (u h ,ui), v = (v h ,vi) 

where Uh, Ui, Vh, Vi are the high and low components of the vectors u and v, respectively. 
Then we can write 

\u\ = \u h \2 n / 2 + \ Ul \ 

M = K|2 n/2 + M 

from which we have 
\u\\v\ = |u,|M + (\u h \\v h \ + (K| - |»,|)(|u,| - |u h |) + MM)2"/ 2 + \u h \\v h \2 n 

It follows from this expression that only three integer multiplications are needed, namely 
|tiz||Uj|, |ufe||tih|, and (\vh\ — \vi\)(\ui\ — \Uh\); multiplication by a power of 2 is done by 
realigning bits for addition. Each multiplication is of (n/2)-bit numbers. Six additions and 
subtractions of 2n-bit numbers suffice to complete the computation. Each of the additions 
and subtractions can be done with a linear number of gates in logarithmic time. 

If C(n) and D(n) are the size and depth of a circuit for integer multiplication realized 
with this divide-and-conquer method, then we have 

C(n) < 3C(n/2) + en (2.9) 

D(n) < D(n/2) + dlog 2 n (2.10) 

where c and d are constants of the construction. Since C(l) = 1 and D(l) = 1 (one use 
of AND suffices), we have the following theorem, the proof of which is left as an exercise (see 
Problem 2.28). 
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THEOREM 2.9.3 If n = 2 k , the binary multiplication function /^™| t : B ln h-> B ln for n-bit 
binary numbers can be realized by a circuit for the divide-and-conquer algorithm of the following 
size and depth over the basis Q = {A, V, ®}; 

Cn(/H) =0(3 l0 ^")=0(n 1 ^ 3 ) 
^(/il) =OQo&n) 

The size of this divide-and-conquer multiplication circuit is 0(n L585 ), which is much 
smaller than the 0(n ) bound based on carry-save addition. The depth bound can be reduced 
to O(logn) through the use of carry-save addition. (See Problem 2.29.) However, even faster 
multiplication algorithms are known for large n. 

2.9.3 Fast Multiplication 

Schonhage and Strassen [303] have described a circuit to multiply integers represented in 
binary that is asymptotically small and shallow. Their algorithm for the multiplication of n-bit 
binary numbers uses O(nlognloglogn) gates and depth O(logn). It illustrates the point 
that a circuit can be devised for this problem that has depth O(logn) and uses a number of 
gates considerably less than quadratic in n. Although the coefficients on the size and depth 
bounds are so large that their circuit is not practical, their result is interesting and motivates 
the following definition. 

DEFINITION 2.9. 1 Mi n t (n, c) is the size of the smallest circuit for the multiplication of two n-bit 
binary numbers that has depth at most c log 2 n for c > 0. 

The Schonhage-Strassen circuit demonstrates that M- lnt (n, c) = 0(n log n log log n) for 
all n > 1. It is also clear that M- m t(n, c) = Q(n) because any multiplication circuit must 
examine each component of each binary number and no more than a constant number of 
inputs can be combined by one gate. (Chapter 9 provides methods for deriving lower bounds 
on the size and depth of circuits.) 

Because we use integer multiplication in other circuits, it is convenient to make the follow- 
ing reasonable assumption about the dependence of M- m t{n, c) on n. We assume that 

M int (dn,c) < dM int (n,c) 

for all d satisfying < d < 1 . This condition is satisfied by the Schonhage-Strassen circuit. 

2.9.4 Very Fast Multiplication 

If integers in the set {0, 1, . . . , N — 1} are represented by the exponents of primes in their 
prime factorization, they can be multiplied by adding exponents. The largest exponent on a 
prime in this range is at most log 2 N. Thus, exponents can be represented by O (log log A^) 
bits and integers multiplied by circuits with depth 0(logloglog A^). (See Problem 2.32.) 
This depth is much smaller than ©(loglogA^), the depth of circuits to add integers in any 
fixed radix system. (Note that if A^ = 2", log 2 log 2 N = log 2 n.) However, addition is very 
difficult in this number system. Thus, it is a fast number system only if the operations are 
limited to multiplications. 
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2.9.5 Reductions to Multiplication 

The logical shifting function / shift can be reduced to integer multiplication function /^ ult , as 
can be seen by letting one of the two n-tuple arguments be a power of 2. That is, 



&(«••) = TJ? (/&(«• V) 



An) 
Jsh- 

where y = /<j e code( s ) * s tne vame of the decoder function (see Section 2.5) that maps a binary 
m-tuple, m = |~log 2 n~\ , into a binary 2 m -tuple containing a single 1 at the output indexed 
by the integer represented by s and irff is the projection operator defined on page 50. 

LEMMA 2.9. 1 The logical shifting function / s j™ ft can be reduced to the binary integer multipli- 
cation function /muit through the application of the decoder function /decode on m = [log 2 n\ 
inputs. 

As shown in Section 2.5, the decoder function f^? codc can be realized with a circuit of size 
very close to 2 m and depth [log 2 to] . Thus, the shifting function has circuit size and depth 
no more than constant factors larger than those for integer multiplication. 

The squaring function /square : B n <— > B 2n maps the binary n-tuple x into the binary 
2n-tuple y representing the product of x with itself. Since the squaring and integer multipli- 
cation functions contain each other as subfunctions, as shown below, circuits for one can be 
used for the other. 



O) ™„^.'„„ */„„ -.,....„•„„ £.„^:„„ *(") 



uarc 



LEMMA 2.9.2 The integer multiplication function / nlu j t contains the squaring function / sq 
as a subfunction and /square contains / nlult as a subfunction. 

Proof The first statement follows by setting the two n-tuple inputs of f^it to be the input 

to /square- The second statement follows by examining the value of /square on the (3n+ 1)- 
tuple input (xzy), where x and y are binary n-tuples and z is the zero binary (n+ l)-tuple. 
Thus, (xzy) denotes the value a = 2 2n+1 \x\ + \y\ whose square b is 

b = 2 An+2 \x\ 1 + 2 ln+2 \x\\y\ + \y\ 2 

The value of the product |sc||y| can be read from the output because there is no carry 
into 2 2n+2 |a;||y| from \y\ 2 , nor is there a carry into 2 An+2 \x\ 2 from 2 2n+2 |a;||y|, since 
L*|,|v|<2 n -1.B 



2.10 Reciprocal and Division 



In this section we examine methods to divide integers represented in binary. Since the division 
of one integer by another generally cannot be represented with a finite number of bits (consider, 
for example, the value of 2/3), we must be prepared to truncate the result of a division. The 
division method presented in this section is based on Newton's method for finding a zero of a 
function. 

Let u = (u n —\, . . . ,U\, Uq) and v = (v„-i, . . . , V\, t>o) denote integers whose magni- 
tudes are u and v. Then the division of one integer u by another v, u/v, can be obtained as the 
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result of taking the product of u with the reciprocal l/v. (See Problem 2.33.) For this reason, 
we examine only the computation of reciprocals of n-bit binary numbers. For simplicity we 
assume that n is a power of 2. 

The reciprocal of the n-bit binary number u = (u n —\, . . . , U\, Uq) representing the in- 
teger u is a fractional number r represented by the (possibly infinite) binary number r = 
(r_i, r_2, r_3, . . .), where 



Some numbers, such as 3, have a binary reciprocal that has an infinite number of digits, such as 
(0, 1, 0, 1, 0, 1, . . .), and cannot be expressed exactly as a binary tuple of finite extent. Others, 
such as 4, have reciprocals that have finite extent, such as (0, 1). 

Our goal is to produce an (n + 2)-bit approximation to the reciprocal of n-bit binary 
numbers. (It simplifies the analysis to obtain an (n + 2)-bit approximation instead of an n-bit 
approximation.) We assume that each such binary number u has a 1 in its most significant po- 
sition; that is, 2™~ 1 < u < 2 n . If this is not true, a simple circuit can be devised to determine 
the number of places by which to shift u left to meet this condition. (See Problem 2.25.) The 
result is shifted left by an equal amount to produce the reciprocal. 

It follows that an (n + 2)-bit approximation to the reciprocal of an n-bit binary number u 
with u n _\ = 1 is represented by r = (r_i, r_ 2 , r_ 3 , . . .), where the first n — 2 digits of r are 
zero. Thus, the value of the approximate reciprocal is represented by the n + 2 components 
(r_(„_i), r_(„), . . . , r_( 2n ))- It follows that these components are produced by shifting r left 
by 2n places and removing the fractional bits. This defines the function / rec L: 



/rtlM 



The approximation described below can be used to compute reciprocals. 

Newton's approximation algorithm is a method to find the zero Xq of a twice contin- 
uously differentiable function h : IR i— > IR on the reals (that is, h(xo) = 0) when h has 
a non-zero derivative h (x) in the neighborhood of xq. As suggested in Fig. 2.19, the slope 
of the tangent to the curve at the point yi, h'(yi), is equal to h(yi)/(yi — J/j+i). For the 
convex increasing function shown in this figure, the value of yi+i is closer to the zero Xg than 



h(x) i 




Xq Vi+i Vi 
Figure 2. 1 9 Newton's method for finding the zero of a function. 
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is l/i. The same holds for all twice continuously differentiable functions whether increasing, 
decreasing, convex, or concave in the neighborhood of a zero. It follows that the recurrence 

y l+l = y l -——— (2.11) 

h {yi) 

provides values increasingly close to the zero of h as long as it is started with a value sufficiently 
close to the zero. 

The function h(y) = 1 — 2 2n /uy has zero y = 2 2n /u. Since h (y) = 2 2n /uy 2 , the 
recurrence (2.11) becomes 

y i+ x = 2yi - uyj/2 2n 

When this recurrence is modified as follows, it converges to the (n + 2)-bit binary reciprocal 
of the n-bit binary number u: 



Vi+l 



2 2n +' yi -uy 2 
2 2n 



The size and depth of a circuit resulting from this recurrence are 0(M mt (n, c) logn) and 
0(log"n), respectively. However, this recurrence uses more gates than are necessary since it 
does calculations with full precision at each step even though the early steps use values of yi 
that are imprecise. We can reduce the size of the resulting circuit to 0(M in t(n, c)) if, instead 
of computing the reciprocal with n + 2 bits of accuracy at every step we let the amount of 
accuracy vary with the number of stages, as in the algorithm recip (m, n) of Fig. 2.20. The 
algorithm recip is called 1 + log 2 n times, the last time when n = 1. 

We now show that the algorithm recip (u, n) computes the function f Iccip (u) = r = 
\2 2n ju\ . In other words, we show that r satisfies ru = 2 2n — s for some < s < u. The 
proof is by induction on n. 

The inductive hypothesis is that the algorithm recip (u,m) produces an (m + 2)-bit 
approximation to the reciprocal of the m-bit binary number u (whose most significant bit is 
1), that is, it computes r = \2 2m /u\ . The assumption applies to the base case of m = 1 since 
u = 1 and r = 4. We assume it holds for m = n/2 and show that it also holds for m = n. 



Algorithm recip (w, n) 
if n = 1 then 

r:=4; 
else begin 

t:= recip (\u/2 n l 2 \,n/2); 
r:= [{2 in l 2+ H-ut 2 )/2 n \ ; 
for j := 3 downto do 

if (u(r + 23) < 2 2n ) then r := r + V ; 
end; 
return (r); 

Figure 2.20 An algorithm to compute r, the (n + 2)-bit approximation to the reciprocal of the 
n-bit binary number u representing the integer u, that is, r = / re " . (u) . 
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Let Mi and Uq be the integers corresponding to the most and least significant n/2 bits 
respectively of u, that is, u = U\l n l 2 + u . Since 2" _1 < u < 2™, 2™/ 2_1 < U\ < 
2 n / 2 . Also, Lj^yiJ = U\. By the inductive hypothesis t = \2 n /u\\ is the value returned by 
recip(«i,n/2); that is, U\t = 2™ — s' for some < s' < U\. Let w = 2 3 "/ 2 + l t — ut 2 . 
Then 

uw = 2 ln+l u x t + 2 3 ™/ 2 + W - [t{u x 2 n/1 + wo)] 2 

Applying U\t = 2" — s' , dividing both sides by 2", and simplifying yields 



We now show that 



uw 


= 2 2rl - L' 


tu 


2" 


2 n/2 




uw ,„ 

> 2 2n 

2 n ~ 


- 8u 



(2.12) 



(2.13) 



by demonstrating that (s' — tu /2 n / 2 ) 2 < 8u. We note that s' < u\ < 2™/ 2 , which implies 
(s') 2 < 2 n/2 ui < u. Also, since u x t = 2" - s' or t < 2 n /u x we have 

since Mi > 2"' 2_1 ,Mo < 2™' 2 , and2" _1 < u. The desired result follows from the observation 
that (a — b) 2 < max (a 2 , b 2 ). 

Since r = [w/2 n \ , it follows from (2.13) that 



w I / w \ uw 



\2 n ) 



ur = u — >m 1= u>2 z "-9m 



in 



.2™ J V2™ / 2 r ' 

It follows that r > (2 2n /u) — 9. Also from (2.12), we see that r < 2 2n /u. The three-step 
adjustment process at the end of recip(M, m) increases ur by the largest integer multiple of 
m less than 16m that keeps it less than or equal to 2 . That is, r satisfies ur = 2 — s for 
some < s < u, which means that r is the reciprocal of u. 

The algorithm for recip(M, n) translates into a circuit as follows: a) recip(M, 1) is 
realized by an assignment, and b) recip(M, n) , n > 1, is realized by invoking a circuit for 
recip([^7iJ,n/2) followed by a circuit for [(2 3 ™/ 2 + 1 t — ut 2 )/2 n \ and one to implement 
the three-step adjustment. The first of these steps computes LwiJ> which does not require 
any gates, merely shifting and discarding bits. The second step requires shifting t left by 3n/2 
places, computing t and multiplying it by m, subtracting the result from the shifted version 
of t, and shifting the final result right by n places and discarding low-order bits. Circuits for 
this have size cAfj nt (n, c) for some constant c > and depth 0(log n). The third step can be 
done by computing ur, adding u2 J for j = 3, 2, 1, or 0, and comparing the result with 2 2 ". 
The comparisons control whether 2 J is added to r or not. The one multiplication and the 
additions can be done with circuits of size c'Af; nt (n, c) for some constant c' > and depth 
0(log n). The comparison operations can be done with a constant additional number of gates 
and constant depth. (See Problem 2.19.) 

It follows that recip can be realized by a circuit whose size C7 rec ip('T-) is no more than a 
multiple of the size of an integer multiplication circuit, Mi nt (n, c), plus the size of a circuit for 
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the invocation of recip([ 2^77 J ,n/2). That is, 

C rec ip(n) < C rec i p (n/2) + cM illt (ii ) c) 

C-'recip^-LJ — i 
for some constant c > 0. This inequality implies the following bound: 

log?i logri 

C recip (n) < cj^ Mint (^7' c ) < cM irt (n,c) ]P — 



\2-?' / ~ v ' ' *—< 13 

3=0 j=0 

= 0(M int (n,c)) 

which follows since M; n t(<in, c) < dM- luii {n, c) when d < 1. 

The depth I? re cip(^) of the circuit produced by this algorithm is at most clogn plus the 
depth D roc jp(n/2). Since the circuit has at most 1 + log 2 n stages with a depth of at most 
clogn each, _D rec i p (n) < 2c log n when n > 2. 

THEOREM 2. 10. 1 Ifin = 2 k , the reciprocal function / rc ™ cip : B n h-> B n+1 for n-bit binary 
numbers can be realized by a circuit with the following size and depth: 

Cn(/ r ( e "ip) <0(M int (n,c)) 

Dn(f^ ip ) <clog 2 2 n 

VERY FAST RECIPROCAL Beame, Cook, and Hoover [33] have given an O(logn) circuit for 
the reciprocal function. It uses a sequence of about n / log n primes to represent an n-bit 
binary number x, .5 < x < 1, using arithmetic modulo these primes. The size of the circuit 
produced is polynomial in n, although much larger than A/; nt (n,c). Reif and Tate [325] show 
that the reciprocal function can be computed with a circuit that is defined only in terms of n 
and has a size proportional to Mint (and thus nearly optimal) and depth O(lognloglogn). 
Although the depth bound is not quite as good as that of Beame, Cook, and Hoover, its size 
bound is very good. 

2.10.1 Reductions to the Reciprocal 

In this section we show that the reciprocal function contains the squaring function as a sub- 
function. It follows from Problem 2.33 and the preceding result that integer multiplication 
and division have comparable circuit size. We use Taylor's theorem [315, p. 345] to establish 
the desired result. 

THEOREM 2. 1 0.2 (Taylor) Let f(x) : IR i— > IR be a continuous real-valued function defined 
on the interval [a, b] whose kth derivative is also continuous for k < n + 1 over the same interval. 
Then for a < xq < x < b, f(x) can be expanded as 

f{x) = /(so) + (x- x )f [1] (x ) + {x ~ X ° )2 fW(x ) + ■■■+ (a: ~f o) " / [ " ] (so) + r n 

2 n! 

where /'"' denotes the nth derivative of f and the remainder r n satisfies 

' f[n+l] {t) ( X -t) n dt 
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(X-X ) n+1 f [ w +l]/-/.\ 

(n+1)! J W > 



for some ip satisfying x < ip < x. 



Taylor's theorem is used to expand \2 ln /|w|J by applying it to the function f(w) = 
(1 + w) _1 on the interval [0, 1]. The Taylor expansion of this function is 

(1 + u;)" 1 = 1 - W + W 2 - W 3 (l + ^)" 4 

for some < ip < 1. The magnitude of the last term is at most w . 
Let n > 12, fc = [ n /2j> ^ = |_ n /12j and restrict |m| as follows: 

\u\ = 2 k + \a\ where 
\a\ =2'|6| + 1 and 
\b\ < 2 1 - 1 - 1 

It follows that \a\ < I 21-1 — 2 l + 1 < 2 2 ' -1 for I > 1. Applying the Taylor series expansion 
to(l + \a\/2 k ) 



_(2 k 






^- k U-F + l^J "l¥» (1 ^ )_ 



(2.14) 



for some < V' ^ 1- F° r the given range of values for |m| both the sum of the first two terms 
and the third term on the right-hand side have the following bounds: 

2 2n-l-*(j _ | a |/ 2 fc) > 2 2n-l-k ^ _ 2 2l ~ l /2 k ) 

2 2 "- 1 - fe (|a|/2 fc ) 2 < 2 2n - l ~ k (2 2l - 1 /2 k ) 2 

Since 2 2l ~ x l2 k < 1/2, the value of the third term, 2 2n - l - k {\a\/2 k ) 2 , is an integer that does 
not overlap in any bit positions with the sum of the first two terms. 

The fourth term is negative; its magnitude has the following upper bound: 

2 2n-l-4fc| |3( 1 + ^-4 < 2 i(2l-l)+2n-l-ik 

Expanding the third term, we have 

2 2 "- 1 - 3fc (|a|) 2 = 2 2 ™- 1 - 3fc (2 2 >| 2 + 2' +1 |6| + 1) 

Because 3(2Z — 1) < k, the third term on the right-hand side of this expansion has value 
2 2n-i-3fe anc j j s [ ar g er tnan tne magnitude of the fourth term in (2.14). Consequently the 
fourth term does not affect the value of the result in (2.14) in positions occupied by the binary 
representation of 2 2n - 1 " 3fc (2 2/ |b| 2 + 2 /+1 |b|). In turn, 2' +1 |b| is less than 2 2 ', which means 
that the binary representation of2 2 " -1-3 (2 |b| 2 ) appears in the output shifted but otherwise 
without modification. This provides the following result. 

LEMMA 2. 1 0. 1 The reciprocal function / rec L contains as a subfunction the squaring function 

/square far 171 = |n/12j - 1. 

Proof The value of the /-bit binary number denoted by b appears in the output if / = 
|n/12j > 1. ■ 

Lower bounds similar to those derived for the reciprocal function can be derived for special 
fractional powers of binary numbers. (See Problem 2.35.) 
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2.1 1 Symmetric Functions 

The symmetric functions are encountered in many applications. Among the important sym- 
metric functions is binary sorting, the binary version of the standard sorting function. A 
surprising fact holds for binary sorting, namely, that it can be realized on n inputs by a cir- 
cuit whose size is linear in n (see Problem 2.17), whereas non-binary sorting requires on the 
order of nlogn operations. Binary sorting, and all other symmetric functions, can be realized 
efficiently through the use of a counting circuit that counts the number of l's among the n 
inputs with a circuit of size linear in n. The counting circuit uses AND, OR, and NOT. When 
negations are disallowed, binary sorting requires on the order of n log n gates, as shown in 
Section 9.6.1. 



DEFINITION 2. 1 I . I A permutation ir of an n-tuple x = (x\,X2, ■ ■ ■ , X n ) is a 
tt(x) = (cc w (i), aV(2)> • ■ • > x ir(n)) of the components ofx. That is, {ir(l), 7r(2), . . . , 7r(n)} = 
{1, 2, 3, ... , n}. A symmetric function /(") : B" i— > B rn is a function for which f^ n \x) = 
/(") (ir(x)) for all permutations n. S nym is the set of all symmetric functions /'"' : B n i— > B"" 
and S n = <SVu is the set of Boolean symmetric functions on n inputs. 

If/( 3 ) is symmetric, then f^ (0,1,1) = / (3) (1.0, 1) = /W (1,1,0). 
The following are symmetric functions: 



1. Threshold functions t\ } : B n >-> B, 1 < t < n: 



oth 



erwise 



2. Elementary symmetric functions e t n : B n i— > B, < t < n: 



(n) 



(X) 



otherwise 



3. Binary sorting function / s ™ rt : B n i— y B n sorts an n-tuple into descending order: 



/i n r ) t W = (r 1 ( " ) ,T 2 ( " ) ,...,rW) 
Here r t is the ith threshold function. 
4. Modulus functions f[ mod m : B n i-»- B, < c < m — 1: 

,(n) , x = f 1 E^i ii = c mod m 

^ c ' modml j \ otherwise 

The elementary symmetric functions ej are building blocks in terms of which other sym- 
metric functions can be realized at small additional cost. Each symmetric function f( n > is 
determined uniquely by its value v t , < t < n, when exactly t of the input variables are 1. It 
follows that we can write f^ n '{x) as 

/<">(«)= V v t Ae{ n \x)= \/ 4 n \x) (2.15) 

0<i<n t|v t =l 
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Thus, efficient circuits for the elementary symmetric functions yield efficient circuits for gen- 
eral symmetric functions. 

An efficient circuit for the elementary symmetric functions can be obtained from a circuit 
for counting the number of l's among the variables x. This counting function /count • 
B n I— > B' loS2 ^ n+1 '' produces a |~log 2 (n + 1)] -bit binary number representing the number of 
l's among the n inputs X\, X%, ■ ■ ■ > x n . 

A recursive construction for the counting function is shown in Fig. 2.21 (b) when m = 
2 +1 — 1. The m inputs are organized into three groups, the first 2—1 Boolean variables u, 
the second 2—1 variables v, and the last variable x m . The sum is represented by I "sum bits" 



,G+i) 



„('+! 



< j < I — 1, and the "carry bit" c)_, . This sum is formed by adding in a ripple 



,(') 



„(*+!) 



adder the outputs s ■ , Q < j < I — 2, and c\ from the two counting circuits, each on 
2—1 inputs, and the mth input x m . (We abuse notation and use the same variables for the 
outputs of the different counting circuits.) The counting circuit on 2 2 — 1 = 3 inputs is the 
full adder of Fig. 2.21(a). From this construction we have the following theorem: 



LEMMA 2. 1 I.I Forn = 2 k 



1 , k > 2, the counting function /, 



(n) 



g" l_> gRog2(™+ 1 )l 



can be realized with the following circuit size and depth over the basis fl = {A, V, ®}; 



(n) 

count 



Ca(f, 

-°n [fcou 
<•(«) 

' count 



< 5{2 k -k- 1) 



(n) 

count 



'CI ^/county < 4fc - 5 

Proof Let C(k) = C n (/ c ( l t ) and !>(*) = £>n (/, 
C(2) = 5 and D(2) = 3 since a full adder over tt = {A, V, ( 
The following inequality is immediate from the construction: 

C{k) < 2C(k- l) + 5(fc- r 



when n = 2 k — 1. Clearly, 
)} has five gates and depth 3. 



C+i) 




„('+!) 



C (0 
4-1 



„a+i) ,a+i) 



„(*+!) 



,a+i) 



„a+i) 



t c (i+1) t r ( ;+1 ) t 
— H C Z-1|— *— 1 c j+lr- L -| 



FA 



FA 



n a 



JO ,(0 ,(0 

,((m-l)/2) 
J count 



t t u t 



FA 



P 



"0 



FA 



J/) 
-l-l 



X 



,(0 ,(') ,(0 



,((m-l)/2) 
J count 



TTVT 



(a) (b) 

Figure 2.2 I A recursive construction for the counting function / c ^ nt , m = 2 — 1. 
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The size bound follows immediately. The depth bound requires a more careful analysis. 

Shown in Fig. 2.21(a) is a full adder together with notation showing the amount by 
which the length of a path from one input to another is increased in passing through it 
when the full-adder circuit used is that shown in Fig. 2.14 and described by Equation 2.6. 
From this it follows that 

D n (eg?)) = max (p a (c£ +1 >) + 2, D n ( S f ) + 3 ) 
Do (4' +1) ) = max ( Da (of 1 )) +l,D a (,f) + l) 

for 2 < I and < j < I — 1, where sj_j = cj_ r It can be shown by induction that 
Aj(c$ fc) ) =2(fc+j)-3, 1 <j <fc-l,and J D n (sf ) ) = 2(k+j)-2, < j < k-2, 
both for 2 < k. (See Problem 2.16.) Thus, D n (/ c ( ",] nt ) = D n (ck-i) = ( 4fc " 5)- ■ 

We now use this bound to derive upper bounds on the size and depth of symmetric func- 
tions in the class 5 n _ m . 

THEOREM 2.1 I.I Every symmetric function p n > : B n i— > B m can be realized with the following 
circuit size and depth over the basis £1 = { A, V, ©} where 4>{k) = 5(2 — k — l); 

Cn (/ H ) < m\{n + l)/2] + <f>{k) + 2(n + 1) + (2[lo g2 (n + 1)] - 2)>/2(n+l) 

Ofi f/ H ) < 5[log 2 (n+ 1)1 + riog 2 riog 2 (n+ 1)H -4 



fork = |Tog 2 (n+ 1)1 



even. 



Proof Lemma 2.11.1 establishes bounds on the size and depth of the function / CO u nt for 
n = 2 k — 1. For other values of n, let k = [log 2 (n + 1)1 and fill out the 2 k — 1 — n 
variables with O's. 

The elementary symmetric functions are obtained by applying the value of / c ™unt ^ 
argument to the decoder function. A circuit for this function has been constructed that has 
size 2(n + 1) + (2[log 2 (n + 1)1 - 2)^/2(n+ 1) and depth |Tog 2 |Tog 2 (n + 1)]] + 1. 
(See Lemma 2.5.4. We use the fact that 2' Sl7n ' < 2m.) Thus, all elementary symmetric 
functions on n variables can be realized with the following circuit size and depth: 

C n (e^,e{ n \...,e^ < <f>(k) + 2(n + 1) + (2\log 2 (n + 1)1 - 2) ^2(n + 1) 

Z? n (4 n) .e[ n) ,....eW) < 4fc- 5+ [log 2 riog 2 (n+ 1)H + 1 

The expansion of Equation (2.15) can be used to realize an arbitrary Boolean symmetric 
function. Clearly, at most n OR gates and depth [log 2 n~\ suffice to realize each one of m 
arbitrary Boolean symmetric functions. (Since the Vt are fixed, no ANDs are needed.) This 
number of ORs can be reduced to (n — l)/2 as follows: if \(n + l)/2] or more elementary 
functions are needed, use the complementary set (of at most \_(n + l)/2j functions) and 
take the complement of the result. Thus, no more than \(n+ l)/2] — 1 ORs are needed per 
symmetric function (plus possibly one NOT), and depth at most [log 2 [((n + l)/2)J] + 1 
< flog 2 (n +!)!.■ 
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This theorem establishes that the binary sorting / sort : B n i— > B n has size 0(n 2 ). In fact, 
a linear-size circuit can be constructed for it, as stated in Problem 2.17. 

2.12 Most Boolean Functions Are Complex 

As we show in this section, the circuit size and depth of most Boolean functions / : B n i— ► B 
on n variables are at least exponential and linear in n, respectively. Furthermore, we show in 
Section 2.13 that such functions can be realized with circuits whose size and depth are at most 
exponential and linear, respectively, in n. Thus, the circuit size and depth of most Boolean 
functions on n variables are tightly bounded. Unfortunately, this result says nothing about the 
size and depth of a specific function, the case of most interest. 

Each Boolean function on n variables is represented by a table with 2 ra rows and one 
column of values for the function. Since each entry in this one column can be completed in 
one of two ways, there are 2 ways to fill in the column. Thus, there are exactly 2 Boolean 
functions on n variables. Most of these functions cannot be realized by small circuits because 
there just are not enough small circuits. 

THEOREM 2. 12.1 LetO < e < 1. The fraction of the Boolean functions f : B n t—> B that 
have size complexity Cn (/) satisfying the following lower bound is at least 1 — 2 _< - £ ' 2 ' 2 when 
n > 2[(1 - e)/e] log 2 [(3e) 2 (l - e/2)]. (Here e = 2.71828 ... is the base of the natural 
logarithm.) 

C no (f)>-(\-e)-2n 2 
n 

Proof Each circuit contains some number, say g, of gates and each gate can be one of the 

three types of gate in the standard basis. The circuit with no gates computes the constant 

functions with value of 1 or on all inputs. 

An input to a gate can either be the output of another gate or one of the n input variables. 

(Since the basis S7o is {AND, OR, NOT}, no gate need have a constant input.) Since each 

gate has at most two inputs, there are at most (g — 1 + n) 2 ways to connect inputs to one 

gate and (g — 1 + n) 2g ways to interconnect g gates. In addition, since each gate can be 

one of three types, there are 3 s ways to name the gates. Since there are g\ orderings of 

g items (gates) and the ordering of gates does not change the function they compute, at 

most N(g) = 3 9 (g + n) 2g / g\ distinct functions can be realized with g gates. Also, since 

g\ > g 9 e~ 9 (see Problem 2.2) it follows that 

N{g) < M 9 [(g 2 + 2gn + n 2 )/g] 9 < (3e) 9 (g + In 2 ) 9 

The last inequality follows because 2gn + n 2 < 2gn 2 for n > 2. Since the last bound is an 
increasing function of g, N(0) = 2 and G + 1 < (3e) for G > 1, the number M(G) of 
functions realizable with between and G gates satisfies 

M{G) <{G+ l)(3e) G (G + 2n 2 ) G < [(3e) 2 (G + 2n 2 )] G < (x x ) 1/a 

where x = a(G + 2n 2 ) and a = (3e) 2 . With base-2 logarithms, it is straightforward to 
show that x x < 2 X ° if x < Xq/ log 2 Xq and Xq > 2. 

If M(G) < 2( 1 "' 5 ) 2 " for < S < 1, at most a fraction 2( 1 " ,5 ) 2 "/2 2 '* = 2" 152 " of the 
Boolean functions on n variables have circuits with G or fewer gates. 
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LetG < 2"(1 - e)/n-2n 2 . Then x = a(G + 2n 2 ) < a2 n (l - e)/n < a; /log 2 .T 
for xo = a2"(l — e/2) when n > 2[(1 — e)/e] log 2 [(3e) 2 (l — e/2)], as can be shown 
directly. It follows that M{G) < {x x ) l/a < 2 X " = 2 2 "( 1 - e / 2 ). ■ 

To show that most Boolean functions / : B n t—> B over the basis fio require circuits with 
a depth linear in n, we use a similar argument. We first show that for every circuit there is a 
tree circuit (a circuit in which either zero or one edge is directed away from each gate) that 
computes the same function and has the same depth. Thus when searching for small-depth 
circuits it suffices to look only at tree circuits. We then obtain an upper bound on the number 
of tree circuits of depth d or less and show that unless d is linear in n, most Boolean functions 
on n variables cannot be realized with this depth. 

LEM MA 2.12.1 Given a circuit for a function f : B n i— > B m , a tree circuit can be constructed of 
the same depth that computes f. 

Proof Convert a circuit to a tree circuit without changing its depth as follows: find a vertex 
V with out-degree 2 or more at maximal distance from an output vertex. Attach a copy of the 
tree subcircuit with output vertex v to each of the edges directed away from v. This reduces 
by 1 the number of vertices with out-degree greater than 1 but doesn't change the depth or 
function computed. Repeat this process on the new circuit until no vertices of outdegree 
greater than 1 remain. ■ 

We count the number of tree circuits of depth d as follows. First, we determine T(d), the 
number of binary, unlabeled, and unoriented trees of depth d. (The root has two descendants 
as does every other vertex except for leaves which have none. No vertex carries a label and we 
count as one tree those trees that differ only by the exchange of the two subtrees at a vertex.) 
We then multiply T(d) by the number of ways to label the internal vertices with one of at 
most three gates and the leaves by at most one of n variables or constants to obtain an upper 
bound on N(d), the number of distinct tree circuits of depth d. Since a tree of depth d has at 
most 2 d — 1 internal vertices and 2 d leaves (see Problem 2.3), N(d) < T(d)3 2 (n + 2) 2 . 

LEMMA 2. 12.2 When d > 4 the number T(d) of depth-d unlabeled, unoriented binary trees 
satisfies T{d) < (56) 2 "~ 4 . 

Proof There is one binary tree of depth 0, a tree containing a single vertex, and one of 
depth 1. Let C(d) be the number of unlabeled, unoriented binary trees of depth d or less, 
including depth 0. Thus, C(0) = 1, T(l) = 1, and C(l) = 2. This recurrence for C{d) 
follows immediately for d > 1 : 

C(d) = C{d - 1) + T{d) (2.16) 

We now enumerate the unoriented, unlabeled binary trees of depth d + 1. Without loss of 
generality, let the left subtree of the root have depth d. There are T(d) such subtrees. The 
right subtree can either be of depth d — 1 or less (there are C(d — 1) such trees) or of depth 
d. In the first case there are T(d)C(d— 1) trees. In the second, there are T(d)(T(d) — l)/2 
pairs of different subtrees (orientation is not counted) and T(d) pairs of identical subtrees. 
It follows that 

T{d + 1) = T{d)C{d - 1) + T{d){T{d) - l)/2 + T{d) (2.17) 
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Thus, T(2) = 2, C(2) = 4, T(3) = 7, (7(3) = 11, and T(4) = 56. From this recurrence 
we conclude thatT(d+l) > T 2 (d)/2. We use this fact and the inequality y > 1/(1—1/?/), 
which holds for y > 2, to show that (T(d + 1)/T(d)) + T(d)/2 < T(d + l)/2. Since 
T(d) > 4 for d > 3, it follows that T(d)/2 > 1/(1 - 2/T(d)). Replacing T(d)/2 by this 
lower bound in the inequality T(d+ 1) > T (d)/2, we achieve the desired result by simple 
algebraic manipulation. We use this fact below. 

Solving the equation (2.17) for C(d — 1), we have 

ow _ I)=5 ^,_m±i) 

Substituting this expression into (2.16) yields the following recurrence: 

T{d + 2) _ T(d+1) (T(d+1) + T(d)) 
T(d+1) ~ T{d) + 2 

Since (T(d + l)/T(d)) + T(d)/2 < T(d + l)/2, it follows that T(d + 2) satisfies the 
inequality T(d + 2) < T 2 (d + 1) when d > 3 or T(d) < T 2 (d - 1) when d > 5 and 
d - 1 > 4. Thus, T(d) < T 2 ' (d - j) for d - j > 4 or T(d) < (56) 2d_4 for d > 4. ■ 

Combine this with the early upper bound on N(d) for the number of tree circuits over f2o 
ofdepthd and we have that N(d) < c 2 " ford > 4, where c = 3((56) 1/16 )(n + 2). (Note that 
3(56) : / 16 < 4.) The number of such trees of depth through d is at most N(d+ 1) < c 2 . 
But if c 2 is at most 2 2 ( 1-(5 ), then a fraction of at most 2 2 of the Boolean functions on 
n variables have depth D or less. But this holds when 

D = n — 1 — Slog 2 e — log 2 log 2 4(n + 2) = n — log log n — 0(1) 

since ln(l — x) < — x. Note that d > 4 implies that n > d + 1. 

THEOREM 2. 1 2.2 For each < 5 < I a fraction of at least 1 — 2 2 q/V/v Boolean functions 
f : B n I— > S /irfw <a?e^Z"A complexity Dq (f) that satisfies the following bound when n > 5: 

£>n (/)>n-loglogn-O(l) 

As the above two theorems demonstrate, most Boolean functions on n variables require 
circuits whose size and depth are approximately 2 n /n and n, respectively. Fortunately, most 
of the useful Boolean functions are far less complex than these bounds suggest. In fact, we 
often encounter functions whose size is polynomial in n and whose depth is logarithmic in or 
a small polynomial in the logarithm of the size of its input. Functions that are polynomial in 
the logarithm of n ate called poly-logarithmic. 



2.13 Upper Bounds on Circuit Size 



In this section we demonstrate that every Boolean function on n variables can be realized with 
circuit size and depth that are close to the lower bounds derived in the preceding section. 
We begin by stating the obvious upper bounds on size and depth and then proceed to obtain 
stronger (that is, smaller) upper bounds on size through the use of refined arguments. 
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As shown in Section 2.2.2, every Boolean function / : B n i— > B can be realized as the OR 
of its minterms. As shown in Section 2.5.4, the minterms on n variables are produced by the 
decoder function / d " co dc : B n *— > B 1 , which has a circuit with 2™ + (In — 2)2™' 2 gates and 
depth |~log 2 n\ + 1. Consequently, we can realize / from a circuit for / d " co de and an OR tree 
on at most 2™ inputs (which has at most 2" — 1 two-input ORs and depth at most n). We 
have that every function / : B n i— > B has circuit size and depth satisfying: 



Cn(f) < Co (/il de ) + 2" - 1 < 2" +1 + (2n - 2)2"/ 2 
Ai(/) < Ai f/^odc) + n < " + flog 2 n + 11 + 1 



1 



Thus every Boolean function / : B n i— > B can be realized with an exponential number of 
gates and depth n + [log 2 n\ + 1 . Since the depth lower bound of n — O (log log n) applies to 
almost all Boolean functions on n variables (see Section 2.12), this is a very good upper bound 
on depth. We improve upon the circuit size bound after summarizing the depth bound. 

THEOREM 2.13.1 The depth complexity of every Boolean function f : B n i— > B satisfies the 
following bound: 

£>«„(/) <n+flog 2 nl+l 

We now describe a procedure to construct circuits of small size for arbitrary Boolean func- 
tions on n variables. By the results of the preceding section, this size will be exponential in n. 
The method of approach is to view an arbitrary Boolean function / : B n i— > B on n input vari- 
ables a; as a function of two sets of variables, a, the first k variables of x, and b, the remaining 
n — k variables of a;. That is, x = ab where a = \x\, ■ ■ ■ , Xk) and b = [Xk+\, ■ ■ ■ > x n ). 

As suggested by Fig. 2.22, we rearrange the entries in the defining table for / into a rectan- 
gular table with 2 rows indexed by a and 2™ columns indexed by b. The lower right-hand 
quadrant of the table contains the values of the function /. The value of / on x is the entry 
at the intersection of the row indexed by the value of a and the column indexed by the value 
of b. We fix s and divide the lower right-hand quadrant of the table into p — 1 groups of s 
consecutive rows and one group of s 1 < s consecutive rows where p = \2 /s] . (Note that 
(p — l)s + s' = 2 .) Call the ith collections of rows A;. This table serves as the basis for the 
(fc, s)-Lupanov representation of/, from which a smaller circuit for / can be constructed. 

Let fi : B n i— ► B be / restricted to An that is, 

J /(as) ifaeA 

I otherwise. 

It follows that / can be expanded as the OR of the ff. 

We now expand fi. When b is fixed, the values for fi(ab) when a € Ai constitute an 
s-tuple (s'-tuple) ■uforl < i < p — 1 (for i = p). Let B^v be those (n — fc)-tuples b for 
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Figure 2.22 The rectangular representation of the defining table of a Boolean function used in 
its (k, s)-Lupanov representation. 



which v is the tuple of values of /, when a G A jp (Note that the non-empty sets B iv for 
different values of v are disjoint.) Let fiy{b) : B n ~ k \—> B be defined as 



p(c) 



/$(&) 



1 ifb€B itV 
otherwise. 



rM, 



Finally, we let /^(a) : B ' i— > B be the function that has value Vj, the jth component of v, 
when a is the jth fc-tuple in Af. 



/$(«) 



1 if a is the jth element of Ai and Vj = 1 
otherwise. 



It follows that fi(x) = \/ v fiv( a )fiv(b)- Given these definitions, / can be expanded in 
the following (fc, s)-Lupanov representation: 



f(*) = \/\/ftv(»)^mb) 



(2.19) 



We now bound the number of logic elements needed to realize an arbitrary function / : B n i— > 
B in this representation. 

Consider the functions f^ v , (a) for a fixed value of v. We construct a decoder circuit for 

the minterms in a that has size at most 2 " + (k — 2)2 ' . Each of the functions f^y can be 
realized as the OR of s minterms in a for 1 < i < p — 1 and s' minterms otherwise. Thus, 
(p— l)(s — 1) + (s' — 1) < 2 two-input OR's suffice for all values of i and a fixed value of v. 
Hence, for each value oft) the functions f^ can be realized by a circuit of size 0(2 ). Since 



,(r 



k+s\ 



there are at most 2 s choices for v, all f\ ^ can be realized by a circuit of size 0(2 

Consider next the functions flv(b). We construct a decoder circuit for the minterms of 
b that has size at most 2 n ~ k + (n — k — 2)2^ n ~ k ^ 2 . Since for each i, 1 < i < p, the sets 
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Biv for different values of v are disjoint, /^(fe) can be realized as the OR of at most 2™ 
minterms using at most 2" two-input ORs. Thus, all fi<v(b), 1 < i < p, can be realized 
whhp2 n - k + 2 n ~ k + (n-k- 2)2("" fc )/ 2 gates. 

Consulting (2.19), we see that to realize / we must add one AND gate for each i and tuple 
v. We must also add the number of two-input OR gates needed to combine these products. 
Since there are at most p2 s products, at least pl s OR gates are needed for a total of p2 a+ 
gates. 

Let Ck, s (f) be the total number of gates needed to realize / in the (k, s)-Lupanov repre- 
sentation. Ck, s (f) satisfies the following inequality: 

CkM) < 0(2 k+s ) + 0(2("- fc )) + p(2 n ~ k + 2 S+1 ) 
Since p = \2 k / s\ , p < 2 /s + 1, this expands to 

C k , s (f) < 0{2 k+s ) + 0(2 n ~ k ) +—+ 

s s 

Now let k = [3 log 2 n\ and s = \n — 5 log 2 n\ . Then, k + s < n — log 2 n 1 + 2 and 
n — k < n — log 2 n 3 . As a consequence, for large n, we have 

/ 2" \ ( 2 n 



n 2 J \ n 3 / (n — 5 log 2 n) 

We summarize the result in a theorem. 

THEOREM 2. 1 3.2 For each e > there exists some No > 1 such that for all n > Nq every 
Boolean function f : B n i— > B has a circuit size complexity satisfying the following upper bound: 

Cn„(/) <— (1 + c) 
n 

Since we show in Section 2.12 that for < e < 1 almost all Boolean functions / : B n i— > 
B have a circuit size complexity satisfying 

2" 



Co (/)>-(l-e)-2n 



// 



for n > 2[(1 — e)/e] log 2 [(3e) 2 (l — e/2)], this is a good lower bound. 

Problems 

MATHEMATICAL PRELIMINARIES 

2.1 Show that the following identities on geometric series hold: 

ft (-') 

S 
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10 11 12 13 



Figure 2.23 The natural logarithm of the factorial n\ is X^fc = i hifc, which is bounded belc 
by / In a; dx and above by J* ln(x +1) da;. 



2.2 Derive tight upper and lower bounds on the factorial function n\ = n(n — 1) ■ ■ ■ 3 2 1. 

Hint: Derive bounds on In n! where In is the natural logarithm. Use the information 
given in Fig. 2.23. 

2.3 Let T(d) be a complete balanced binary tree of depth d. T{\), shown in Fig. 2.24(a), 
has a root and two leaves. T(d) is obtained by attaching to each of the leaves of T(l) 
copies of T(d — 1). 7~(3) is shown in Fig. 2.24(b). 

a) Show by induction that T(d) has 2 leaves and 2—1 non-leaf vertices. 

b) Show that any binary tree (each vertex except leaves has two descendants) with n 
leaves has n — 1 non-leaf vertices and depth at least [log 2 n\ . 





(a) (b) 

Figure 2.24 Complete balanced binary trees a) of depth one and b) depth 3. 
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BINARY FUNCTIONS AND LOGIC CIRCUITS 

2.4 a) Write a procedure EXOR in a language of your choice that writes the description 

of the straight-line program given in equation (2.2). 
b) Write a program in a language of your choice that evaluates an arbitrary straight- 
line program given in the format of equation (2.2) in which each input value is 
specified. 

2.5 A set of Boolean functions forms a complete basis fl if a logic circuit can be constructed 
for every Boolean function / : B" i— > B using just functions in Q. 

a) Show that the basis consisting of one function, the NAND gate, a gate on two 
inputs realizing the NOT of the AND of its inputs, is complete. 

b) Determine whether or not the basis {AND, OR} is complete. 

2.6 Show that the CNF of a Boolean function / is unique and is the negation of the DNF 
of/. 

2.7 Show that the RSE of a Boolean function is unique. 

2.8 Show that any SOPE (POSE) of the parity function /i has exponentially many terms. 
Hint: Show by contradiction that every term in a SOPE (every clause of a POSE) 
of /m contains every variable. Then use the fact that the DNF (CNF) of fL has 
exponentially many terms to complete the proof. 

2.9 Demonstrate that the RSE of the OR of n variables, /v , includes every product term 
except for the constant 1. 

(n) 

2.10 Consider the Boolean function f m ^ d 3 on n variables, which has value 1 when the sum 
of its variables is zero modulo 3 and value otherwise. Show that it has exponential-size 
DNF, CNF, and RSE normal forms. 

Hint: Use the fact that the following sum is even: 

0<j<k v J 

2.1 1 Show that every Boolean function f( n > : B n (— > B can be expanded as follows: 

f(x t ,x 2 ,...,x n ) = xj(l,x 2 ,...,x n ) V x 1 f(0,x 2 ,...,x n ) 

Apply this expansion to each variable of f(x\, x 2 , X3) = x{x 2 V x 2 x^ to obtain its 
DNF. 

2.12 In a dual-rail logic circuit and 1 are represented by the pairs (0, 1) and (1,0), re- 
spectively. A variable x is represented by the pair (a;, x). A NOT in this representation 
(called a DRL-NOT) is a pair of twisted wires. 

a) How are AND (DRL-AND) and OR (DRL-OR) realized in this representation? Use 
standard AND and OR gates to construct circuits for gates in the new representa- 
tion. Show that every function / : B n 1— > B m can be realized by a dual-rail logic 
circuit in which the standard NOT gates are used only on input variables (to obtain 
the pair (x,x)). 
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b) Show that the size and depth of a dual-rail logic circuit for a function / : B n i— > B 
are at most twice the circuit size (plus the NOTs for the inputs) and at most one 
more than the circuit depth of/ over the basis {AND, OR, NOT}, respectively. 

2.13 A function / : B n i— > B is monotone if for all 1 < j < n, f{x\, . . . , Xj-\, 0, Xj+u 
■ ■ ■ > x n ) < f(x\, . . . , Xj—i, 1, Xj+\, . . . , x n ) for all values of the remaining variables; 
that is, increasing any variable from to 1 does not cause the function to decrease its 
value from 1 to 0. 

a) Show that every circuit over the basis f2 mon = {AND, OR} computes monotone 
functions at every gate. 

b) Show that every monotone function f( n > : B n (— > B can be expanded as follows: 

f{x u x 2 ,. ..,x n )= Xif(l,x 2 ,.. -,x n ) V f(0,x 2 ,-- -,x n ) 

Show that this implies that every monotone function can be realized by a logic circuit 
over the monotone basis f2 mon = {AND, OR}. 

SPECIALIZED FUNCTIONS 

2.14 Complete the proof of Lemma 2.5.3 by solving the recurrences stated in Equation (2.4). 

2. 1 5 Design a multiplexer circuit of circuit size 2" +1 plus lower-order terms when n is even. 
Hint: Construct a smaller circuit by applying the decomposition given in Section 2.5.4 
of the minterms of n variables into minterms on the two halves of the n variables. 

2.16 Complete the proof of Lemma 2.11.1 by establishing the correctness of the inductive 
hypothesis stated in its proof. 

2.17 The binary sorting function is defined in Section 2.11. Show that it can be realized 
with a circuit whose size is O(n) and depth is 0(log n). 

Hint: Consider using a circuit for / c ^ nt , a decoder circuit and other circuitry. Is there 
a role for a prefix computation in this problem? 

LOGICAL FUNCTIONS 

2.18 Let /Winter : # ( ™ +1)b >-> B be defined below. 

,(n) , s 1 Xi = y for some 1 < i < n 

/LL.(«1.-*. -.«n,V) = | otherwise 

where Xi,y G B and Xi = y if and only if they agree in each position. 

Obtain good upper bounds to C n [f^ mbei J and D n ^/ n " mbcr J by constructing a 

circuit over the basis 51 = {A, V, ->, ©}. 

2.19 Design a circuit to compare two n-bit binary numbers and return the value 1 if the first 
is larger than or equal to the second and otherwise. 

Hint: Compare each pair of digits of the same significance and generate three out- 
comes, yes, maybe, and no, corresponding to whether the first digit is greater than, 
equal to or less than the second. How can you combine the outputs of such a compar- 
ison circuit to design a circuit for the problem? Does a prefix computation appear in 
your circuit? 



86 Chapter 2 Logic Circuits Models of Computation 

PARALLEL PREFIX 

2.20 a) Let CO py '■ S 2 i— > S be the operation 

a ©copy b= a 

Show that (S, ©copy) is a semigroup for S an arbitrary non-empty set. 
b) Let ■ denote string concatenation over the set {0, 1}* of binary strings. Show that 
it is associative. 

2.21 The segmented prefix computation with the associative operation on a "value" n- 
vector x over a set S, given a "flag vector" cj> over B, is defined as follows: the value 
of the ith entry j/j of the "result vector" y is X% if its flag is 1 and otherwise is the 
associative combination with of Xi and the entries to its left up to and including the 
first occurrence of a 1 in the flag array. The leftmost bit in every flag vector is 1 . An 
example of a segmented prefix computation is given in Section 2.6. 

Assuming that (S, 0) is a semigroup, a segmented prefix computation over the set 
S x B of pairs is a special case of general prefix computation. Consider the operator 
on pairs (xj, 0j) of values and flags defined below: 

[ [x\ <3x 2 ,<Pi) </>2 = 

Show that ((S, B), 0) is a semigroup by proving that (S, B) is closed under the oper- 
ator and that the operator is associative. 

2.22 Construct a logic circuit of size 0(n log n) and depth 0(log n) that, given a binary n- 
tuple x, computes the n-tuple y containing the running sum of the number of 1 s in x. 

2.23 Given 2n Boolean variables organized as pairs 0a or la, design a circuit that moves pairs 
of the form la to the left and the others to the right without changing their relative 
order. Show that the circuit has size 0(n log n). 

2.24 Linear recurrences play an important role in many problems including the solution 
of a tridiagonal linear system of equations. They are defined over "near-rings," which 
are slightly weaker than rings in not requiring inverses under the addition operation. 
(Rings are defined in Section 6.2.1.) 

A near-ring (1Z, ■, +) is a set 1Z together with an associative multiplication operator ■ 
and an associative and commutative addition operator +. (If + is commutative, then 
for all a, b G 1Z, a + b = b + a.) In addition, ■ distributes over +; that is, for all 
a, b, c s 1Z, a ■ (6 + c) = a ■ o + a • c. 

A first-order linear recurrence of length n is an n-tuple x = (xi,x 2 , . . . , x n ) of vari- 
ables over a near-ring (1Z, ■, +) that satisfies X\ = b\ and the following set of identities 
for 2 < j < n defined in terms of elements {a,j,bj G TZ \ 1 < j < n}: 

Xj =a 3 -Xj-i+bj 

Use the ideas of Section 2.7 on carry-lookahead addition to show that Xj can be written 

where the pairs (e,, dj) are the result of a prefix computation. 
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ARITHMETIC OPERATIONS 

2.25 Design a circuit that finds the most significant non-zero position in an n-bit binary 
number and logically shifts the binary number left so that the non-zero bit is in the most 
significant position. The circuit should produce not only the shifted binary number but 
also a binary representation of the amount of the shift. 

2.26 Consider the function ir[j, k] = tt[j, k — 1] o 7r[fc, k] for 1 < j < k < n — I, where o 
is defined in Section 2.7.1. Show by induction that the first component of 7r[j, k] is 1 
if and only if a carry propagates through the full adder stages numbered j,j+l,...,k 
and its second component is 1 if and only if a carry is generated at one of these stages, 
propagates through subsequent stages, and appears as a carry out of the fcth stage. 

2.27 Give a construction of a circuit for subtracting one n-bit positive binary integer from 
another using the twos-complement operation. Show that the circuit has size 0(n) 
and depth O(logn). 

2.28 Complete the proof of Theorem 2.9.3 outlined in the text. In particular, solve the 
recurrence given in equation (2.10). 

2.29 Show that the depth bound stated in Theorem 2.9.3 can be improved from 0(log n) 
to 0(log n) without affecting the size bound by using carry-save addition to form the 
six additions (or subtractions) that are involved at each stage. 

Hint: Observe that each multiplication of (n/2)-bit numbers at the top level is ex- 
panded at the next level as sums of the product of (n/4)-bit numbers and that this type 
of replacement continues until the product is formed of 1-bit numbers. Observe also 
that 2n-bit carry-save adders can be used at the top level but that the smaller carry-save 
adders can be used at successively lower levels. 

2.30 Residue arithmetic can be used to add and subtract integers. Given positive relatively 
prime integerspi,P2> ■ • ■ >Pfc ( n o common factors), an integer n in the set {0, 1,2,..., 
N — 1}, N = p\Pi ■ ■ -pk, can be represented by the k -tuple n = (n\, n-i, . . . , n^), 
where rij = n mod pj. Let n and m be in this set. 

a) Show that if n ^ m, n 7^ m. 

b) Form n + m by adding corresponding jth components modulo Pj. Show that 
n + m uniquely represents (n + m) mod N. 

c) Form n x m by multiplying corresponding jth components of n and m modulo 
Pj. Show that nxrais the unique representation for (nm) mod N. 

2.31 Use the circuit designed in Problem 2.19 to build a circuit that adds two n-bit binary 
numbers modulo an arbitrary third n-bit binary number. You may use known circuits. 

2.32 In prime factorization an integer n is represented as the product of primes. Let p(N) 
be the largest prime less than N. Then, n G {2, . . . , N — 1} is represented by the 
exponents (e2, e^, . . ., e p (jv)), where n = 2 e2 3 fi3 . . .p(N) e p( N K The representation 
for the product of two integers in this system is the sum of the exponents of their 
respective prime factors. Show that this leads to a multiplication circuit whose depth 
is proportional to log log log N. Determine the size of the circuit using the fact that 
there are 0(N/ log N) primes in the set {2, . . . , N — 1}. 
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2.33 Construct a circuit for the division of two n-bit binary numbers from circuits for the 
reciprocal function / r ^. ip and the integer multiplication function /^ lt . Determine 
the size and depth of this circuit and the accuracy of the result. 

2.34 Let / : B n i— > B kn be an integer power of a;; that is, f(x) = x k for some integer k. 
Show that such functions contain the shifting function / sh ;fl. as a subfunction for some 
integer m. Determine m dependent on n and k. 

2.35 Let / : B n i-> B" be a fractional power of a; of the form f(x) = \x q / 2 ], < 
q < 2 < log 2 n. Show that this function contains the shifting function / shift as a 
subfunction. Find the largest value of m for which this holds. 



Chapter Notes 



Logic circuits have a long history. Early in the nineteenth century Babbage designed me- 
chanical computers capable of logic operations. In the twentieth century logic circuits, called 
switching circuits, were constructed of electromechanical relays. The earliest formal analysis of 
logic circuits is attributed to Claude Shannon [306]; he applied Boolean algebra to the analysis 
of logic circuits, the topic of Section 2.2. Reduction between problems, a technique central 
to computer science, is encountered whenever one uses an existing program to solve a new 
problem by pre-processing inputs and post-processing outputs. Reductions also provide a way 
to identify problems with similar complexity, an idea given great importance by the work of 
Cook [74], Karp [159], and Levin [199] on NP-completeness. (See also [335].) This topic is 
explored in depth in Chapter 8. 

The upper bound on the size of ripple adder described in Section 2.7 cannot be improved, 
as shown by Red'kin [276] using the gate elimination method of Section 9.3.2. Prefix compu- 
tations, the subject of Section 2.6, were first used by Ofman [234]. He constructed the adder 
based on carry-lookahead addition described in Section 2.7. Krapchenko [173] and Brent 
[57] developed adders with linear size whose depth is [log n\ + 0(y [log n\ ), asymptotically 
almost as good at the best possible depth bound of [log n~\ . 

Ofman used carry-save addition for fast integer multiplication [234]. Wallace indepen- 
dently discovered carry-save addition and logarithmic depth circuits for addition and multipli- 
cation [356]. The divide-and-conquer integer multiplication algorithm of Section 2.9.2 is due 
to Karatsuba [155]. As mentioned at the end of Section 2.9, Schonhage and Strassen [303] 
have designed binary integer multipliers of depth O(logn) whose size is O(nlognloglogn). 

Sir Isaac Newton around 1665 invented the iterative method bearing his name used in 
Section 2.10 for binary integer division. Our treatment of this idea follows that given by Tate 
[325]. Reif and Tate [278] have shown that binary integer division can be done with circuit 
size 0(n log n log log n) and depth 0(log n log log n) using circuits whose description is log- 
space uniform. Beame, Cook, and Hoover [33] have given an O (log n) -depth circuit for the 
reciprocal function, the best possible depth bound up to a constant multiple, but one whose 
size is polynomial in n and whose description is not uniform; it requires knowledge of about 
n 2 / log n primes. 

The key result in Section 2.11 on symmetric functions is due to Muller and Preparata 
[226] . As indicated, it is the basis for showing that every one-output symmetric function can 
be realized by a circuit whose size and depth are linear and logarithmic, respectively. 
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Shannon [307] developed lower bounds for two-terminal switching circuits of the type 
given in Section 2.12 on circuit size. Muller [224] extended the techniques of Shannon to 
derive the lower bounds on circuit size given in Theorem 2.12.1. Shannon and Riordan [281] 
developed a lower bound of 0(2™/ log n) on the size of Boolean formulas, circuits in which the 
fan-out of each gate is 1. As seen in Chapter 9, such bounds readily translate into lower bounds 
on depth of the form given Theorem 2.12.2. Gaskov, using the Lupanov representation, has 
derived a comparable upper bound [110]. 

The upper bound on circuit size given in Section 2.13 is due to Lupanov [208]. Shannon 
and Riordan [281] show that a lower bound of f2(2' l /logn) must apply to the formula size 
(see Definition 9.1.1) of most Boolean functions on n variables. Given the relationship of 
Theorem 9.2.2 between formula size and depth, a depth lower bound of n — log log n — 0( 1 ) 
follows. 

Early work on circuits and circuit complexity is surveyed by Paterson [237] and covered in 
depth by Savage [287] . More recent coverage of this subject is contained in the survey article 
by Bopanna and Sipser [50] and books by Wegener [360] and Dunne [92], 
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CHAPTER 




Machines with Memory 



As we saw in Chapter 1 , every finite computational task can be realized by a combinational 
circuit. While this is an important concept, it is not very practical; we cannot afford to design 
a special circuit for each computational task. Instead we generally perform computational tasks 
with machines having memory. In a strong sense to be explored in this chapter, the memory of 
such machines allows them to reuse their equivalent circuits to realize functions of high circuit 
complexity. 

In this chapter we examine the deterministic and nondeterministic finite-state machine 
(FSM), the random-access machine (RAM), and the Turing machine. The finite-state machine 
moves from state to state while reading input and producing output. The RAM has a central 
processing unit (CPU) and a random-access memory with the property that each memory 
word can be accessed in one unit of time. Its CPU executes instructions, reading and writing 
data from and to the memory. The Turing machine has a control unit that is a finite-state 
machine and a tape unit with a head that moves from one tape cell to a neighboring one in 
each unit of time. The control unit reads from, writes to, and moves the head of the tape unit. 

We demonstrate through simulation that the RAM and the Turing machine are universal 
in the sense that every finite-state machine can be simulated by the RAM and that it and the 
Turing machine can simulate each other. Since they are equally powerful, either can be used as 
a reference model of computation. 

We also simulate with circuits computations performed by the FSM, RAM, and Turing 
machine. These circuit simulations establish two important results. First, they show that all 
computations are constrained by the available resources, such as space and time. For example, 
if a function / is computed in T steps by the RAM with storage capacity S (in bits), then S 
and T must satisfy the inequality Cq(/) = O(ST), where Cn(/) is the size of the smallest 
circuit for / over the complete basis 0. Any attempt to compute / on the RAM using space 
S and time T whose product is too small will fail. Second, an 0(log ST) -space, O (ST) -time 
program exists to write the descriptions of circuits simulating the above machines. This fact 
leads to the identification in this chapter of the first examples of P-complete and NP-complete 
problems. 
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Chapter 3 Machines with Memory 



Models of Computation 



3.1 Finite-State Machines 

The finite-state machine (FSM) has a set of states, one of which is its initial state. At each unit 
of time an FSM is given a lettet from its input alphabet. This causes the machine to move 
from its current state to a potentially new state. While in a state, the FSM produces a letter 
from its output alphabet. Such a machine computes the function defined by the mapping 
from its initial state and strings of input letters to strings of output letters. FSMs can also be 
used to accept strings, as discussed in Chapter 4. Some states are called final states. A string 
is recognized (or accepted) by an FSM if the last state entered by the machine on that input 
string is a final state. The language recognized (or accepted) by an FSM is the set of strings 
accepted by it. We now give a formal definition of an FSM. 

DEFINITION 3.1.1 A finite-state machine (FSM) M is a seven-tuple M = (E, <£, Q, 5, A, s, 
F), where E is the input alphabet, *& is the output alphabet, Q is the finite set of states, 
5 : Q x E i— > Q is the next-state function, A: Qi-> $« the output function, s is the initial 

state (which may be fixed or variable), and F is the set of final states (F C Q). If the FSM is 
given input letter a when in state q, it enters state 5(q, a). While in state q it produces the output 
letter \{q). 

The FSM M accepts the string w £ E* if the last state entered by M on the input string w 
starting in state s is in the set F. M recognizes (or accepts,) the language L consisting of the set 
of such strings. 

When the initial state of the FSM M is not fixed, for each integer T M maps the initial state 
s and its T external inputs W\, Wi, . . . , Wt onto its T external outputs y l , y 2 , ■ ■ ■ , Ux and the 
final state q^ T \ We say that in T steps the FSM M computes the function f KI : Q x E T i— > 
Q x Vl/ , It is assumed that the sets S, ty, and Q are encoded in binary so that f\, is a binary 
function. 

The next-state and output functions of an FSM, 5 and A, can be represented as in Fig. 3.1. 
We visualize these functions taking a state value from a memory and an input value from an 
external input and producing next-state and output values. Next-state values are stored in the 
memory and output values are released to the external world. From this representation an 
actual machine (a sequential circuit) can be constructed (see Section 3.3). Once circuits are 
constructed for 5 and A, we need only add memory units and a clock to construct a sequential 
circuit that emulates an FSM. 



Input 



Output 




Figure 3.1 The finite-state machine model. 
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Figure 3.2 A finite-state machine computing the EXCLUSIVE OR of its inputs. 



An example of an FSM is shown in Fig. 3.2. Its input and output alphabets and state 
sets are S = {0, 1}, ^ = {0, 1}, and Q = {qo,q\}, respectively. Its next-state and output 
functions, 5 and A, are given below. 
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The FSM has initial state qo and final state q\ . As a convenience we explicitly identify final 
states by shading, although in practice they can be associated with states producing a particular 
output letter. 

Each state has a label Qj/Vj, where qj is the name of the state and Vj is the output produced 
while in this state. The initial state has an arrow labeled with the word "start" pointing to 
it. Clearly, the set of strings accepted by this FSM are those containing an odd number of 
instances of 1. Thus it computes the EXCLUSIVE OR function on an arbitrary number of 
inputs. 

While it is conventional to think of the finite-state machine as a severely restricted com- 
putational model, it is actually a very powerful one. The random-access machine (RAM) 
described in Section 3.4 is an FSM when the number of memory locations that it contains 
is bounded, as is always so in practice. When a program is first placed in the memory of 
the RAM, the program sets the initial state of the RAM. The RAM, which may or may not 
read external inputs or produce external outputs, generally will leave its result in its memory; 
that is, the result of the computation often determines the final state of the random-access 
machine. 

The FSM defined above is called a Moore machine because it was defined by E.F. Moore 
[223] in 1956. An alternative FSM, the Mealy machine (defined by Mealy [215] in 1955), 
has an output function A* : Q x E i— > \P that generates an output on each transition from 
one state to another. This output is determined by both the state in which the machine resides 
before the state transition and the input letter causing the transition. It can be shown that the 
two machine models are equivalent (see Problem 3.6): any computation one can do, the other 
can do also. 
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3.1.1 Functions Computed by FSMs 

We now examine the ways in which an FSM might compute a function. Since our goal is to 
understand the power and limits of computation, we must be careful not to assume that an 
FSM can have hidden access to an external computing device. All computing devices must 
be explicit. It follows that we allow FSMs only to compute functions that receive inputs and 
produce outputs at data-independent times. 



To understand the function computed by an FSM M, observe that in initial state q 



and receiving input letter Wu M enters state q^ 1 ' = 5(q(°>, w\) and produces output j/i = 
\{q^ l >). If M then receives input W2, it enters state q^ 2 ' = 6(q^ l ',W2) and produces output 
7/2 = Ml )■ Repeated applications of the functions 6 and A on successive states with suc- 
cessive inputs, as suggested by Fig. 3.3, generate the outputs y\,yi, ■ ■ ■ ,Vt and the final state 
q( '. The function fj, . Q x X hQx ty given in Definition 3.1.1 defines this mapping 
from an initial state and inputs to the final state and outputs: 



/£°(fl<°Ui.«*, 



>wt) = (q {T \y\,y2,-- 



<Vt 



This simulation of a machine with memory by a circuit illustrates a fundamental point about 
computation, namely, that the role of memory is to hold intermediate results on which the 
logical circuitry of the machine can operate in successive cycles. 

When an FSM M is used in a T-step computation, it usually does not compute the most 

(T) 

general function f M that it can. Instead, some restrictions are generally placed on the possible 
initial states, on the values of the external inputs provided to M , and on the components of 
the final state and output letters used in the computation. Consider three examples of the 
specialization of an FSM to a particular task. In the first, let the FSM model be that shown in 
Fig. 3.2 and let it be used to form the EXCLUSIVE OR of n variables. In this case, we supply n 
bits to the FSM but ignore all but the last output value it produces. In the second example, let 
the FSM be a programmable machine in which a program is loaded into its memory before the 
start of a computation, thereby setting its initial state. The program ignores all external inputs 
and produces no output, leaving the value of the function in memory. In the third example, 
again let the FSM be programmable, but let the program that resides initially residing in its 
memory be a "boot program" that treats its inputs as program statements. (Thus, the FSM 
has a fixed initial state.) The boot program forms a program by loading these statements into 
successive memory locations. It then jumps to the first location in this program. 

In each of these examples, the function / that is actually computed by M in T steps is 
a subfunction of the function fj^ because / is obtained by either restricting the values of 





Figure 3.3 A circuit computing the same function, f^ , as a finite-state machine M in T 
steps. 
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the initial state and inputs to M or deleting outputs or both. We assume that every function 

(T) 

computed by M in T steps is a subfunction / of the function f M . 

The simple construction of Fig. 3.3 is the first step in deriving a space-time product in- 
equality for the random-access machine in Section 3.5 and in establishing a connection be- 
tween Turing time and circuit complexity in Section 3.9.2. It is also involved in the definition 
of the P-complete and NP-complete problems in Section 3.9.4. 

3.1.2 Computational Inequalities for the FSM 

In this book we model each computational task by a function that, we assume without loss 
of generality, is binary. We also assume that the function f^j : Q x E i— > Q x vp 
computed in T steps by an FSM M is binary. In particular, we assume that the next-state 
and output functions, 5 and A, are also binary; that is, we assume that their input, state, and 
output alphabets are encoded in binary. We now derive some consequences of the fact that a 
computation by an FSM can be simulated by a circuit. 

/ (T)\ . . . (T) 

The size Cq ( f M J of the smallest circuit to compute the function f M is no larger than 



the size of the circuit shown in Fig. 3.3. But this circuit has size T ■ C^(S, A), where Cq(5, A) 
is the size of the smallest circuit to compute the functions S and A. The depth of the shallowest 
circuit for f^ is no more than T ■ Dq(S, A) because the longest path through the circuit of 
Fig. 3.3 has this length. 

(T) 

Let / be the function computed by M in T steps. Since it is a subfunction of f AI , 
it follows from Lemma 2.4.1 that the size of the smallest circuit for / is no larger than the 
size of the circuit for fj^ . Similarly, the depth of /, Dq(/ ), is no more than that of f AI . 
Combining the observations of this paragraph with those of the preceding paragraph yields the 
following computational inequalities. A computational inequality is an inequality relating 
parameters of computation, such as time and the circuit size and depth of the next-state and 
output function, to the size or depth of the smallest circuit for the function being computed. 

THEOREM 3.1.1 Let fffi be the function computed by the FSM M = (E,^,Q,S,X,s,F) in 

T steps, where 5 and A are the binary next-state and output functions ofM. The circuit size and 
depth over the basis fl of any function f computed by M in T steps satisfy the following inequalities: 



CM) < C n (fiV) < TCn(S,X) 
Dn(f) < Ai(/iT) < TD a (6,X) 



The circuit size Cq(S, A) and depth D^(5, A) of the next-state and output functions of an 
FSM M are measures of its complexity, that is, of how useful they are in computing functions. 
The above theorem, which says nothing about the actual technologies used to realize M, re- 
lates these two measures of the complexity of M to the complexities of the function / being 
computed. This is a theorem about computational complexity, not technology. 

These inequalities stipulate constraints that must hold between the time T and the circuit 
size and depth of the machine M if it is used to compute the function / in T steps. Let the 
product TCfi(S, A) be defined as the equivalent number of logic operations performed by 
M. The first inequality of the above theorem can be interpreted as saying that the number of 
equivalent logic operations performed by an FSM to compute a function / must be at least 
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the minimum number of gates necessary to compute / with a circuit. A similar interpretation 
can be given to the second inequality involving circuit depth. 

The first inequality of Theorem 3.1.1 and the interpretation given to T ■ Cn(6, A) justify 
the following definitions of computational work and power. Here power is interpreted as 
the time rate at which work is done. These measures correlate nicely with our intuition that 
machines that contain more equivalent computing elements are more powerful. 

DEFINITION 3. 1 .2 The computational work done by an FSM M = (£, *, Q, S, A, s, F) is 

TCfi(S, A), the number of equivalent logical operations performed by Ad, which is the product of 
T, the number of steps executed by M , andCn(5, A), the size complexity of its next-state and output 
functions. The power of an FSM M is Cq(<5, A), the number of logical operations performed by 
M per step. 

Theorem 3.1.1 is also a form of impossibility theorem: it is impossible to compute func- 
tions / for which TCq(5, A) and TDq(S, A) are respectively less than the size and depth 
complexity of /. It may be possible to compute a function on some points of its domain 
with smaller values of these parameters, but not on all points. The halting problem, another 
example of an impossibility theorem, is presented in Section 5.8.2. However, it deals with the 
computation of functions over infinite domains. 

The inequalities of Theorem 3.1.1 also place upper limits on the size and depth complex- 
ities of functions that can be computed in a bounded number of steps by an FSM, regardless 
of how the FSM performs the computation. 

Note that there is no guarantee that the upper bounds stated in Theorem 3.1.1 are at all 
close to the lower bounds. It is always possible to compute a function inefficiently, that is, with 
resources that are greater than the minimal resources necessary. 

3.1.3 Circuits Are Universal for Bounded FSM Computations 

We now ask whether the classes of functions computed by circuits and by FSMs executing 
a bounded number of steps are different. We show that they are the same. Many different 
functions can be computed from the function f M by specializing inputs and/or deleting 
outputs. 

THEOREM 3. 1 .2 Every subfunction of the function f M computable by an FSM on n inputs is 
computable by a Boolean circuit and vice versa. 

Proof A Boolean function on n inputs, /, may be computed by an FSM with 2 n+1 — 1 
states by branching from the current state to one of two different states on inputs and 1 
until all n inputs have been read; it then produces the output that would be produced by / 
on these n inputs. A fifteen-state version of this machine that computes the EXCLUSIVE OR 
on three inputs as a subfunction is shown in Fig. 3.4. 

The proof in the other direction is also straightforward, as described above and repre- 
sented schematically in Fig. 3.3. Given a binary representation of the input, output, and state 
symbols of an FSM, their associated next-state and output functions are binary functions. 
They can be realized by circuits, as can /|™ (s, w) = (q^ n \ y), the function computed by 
the FSM on n inputs, as suggested by Fig. 3.3. Finally, the subfunction / is obtained by 
fixing the appropriate inputs, assigning variable names to the remaining inputs, and deleting 
the appropriate outputs. ■ 
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7 /o) U/i) L/ij L/oj (Wi) (wo) (wo) (Wi: 

Figure 3.4 A fifteen-state FSM that computes the EXCLUSIVE OR of three inputs as a subfunc- 
tion of f M obtained by deleting all outputs except the third. 



3.1.4 Interconnections of Finite-State Machines 

Later in this chapter we examine a family of FSMs characterized by a computational unit 
connected to storage devices of increasing size. The random-access machine that has a CPU 
of small complexity and a random-access memory of large but indeterminate size is of this 
type. The Turing machine having a fixed control unit that moves a tape head over a potentially 
infinite tape is another example. 

This idea is captured by the interconnection of synchronous FSMs. Synchronous FSMs 
read inputs, advance from state to state, and produce outputs in synchronism. We allow two 
or more synchronous FSMs to be interconnected so that some outputs from one FSM are 
supplied as inputs of another, as illustrated in Fig. 3.5. Below we generalize Theorem 3.1.1 to 
a pair of synchronous FSMs. We model random-access machines and Turing machines in this 
fashion when each uses a finite amount of storage. 

THEOREM 3. 1 .3 Let f^ xM be a function computed in T steps by a pair of interconnected syn- 
chronous FSMs, M x = ('E u 2 ^ 1 ,Q u Si,Xi,si,Fi) andM 2 = {T, 2 ,^ 2 ,Q 2 ,S 2 ,X 2 ,s 2 ,F 2 ). 
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Figure 3.5 The interconnection of two finite-state machines in which one of the three outputs 
of Mi is supplied as an input to M 2 and two of the three outputs of M 2 are supplied to Mi as 
inputs. 
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Figure 3.6 A circuit simulating T steps of the two synchronous interconnected FSMs shown 
in Fig. 3.5. The top row of circuits simulates a T-step computation by Mi and the bottom row 
simulates a T-step computation by Mi. One of the three outputs of M\ is supplied as an input 
to Mi and two of the three outputs of M2 are supplied to Mi as inputs. The states of Mi on the 
initial and T successive steps are qo,qi, . . . ,qr- Those of M% are po, pi , — , Pt ■ 



Let Cq(5, A) and Dq(5, A) be the size and depth of encodings of the next-state and output func- 
tions. Then, the circuit size and depth over the basis of any function f computed by the pair 

(T) 

Mi x M2 in T steps (that is, a subfunction of y M x M J satisfy the following inequalities: 

Ca(f)<T[Cn(6i,\i) + C u (62,\2)} 

Dn(f) < T[mex(D a (8u\i),Dt l (5 2 ,\2))] 
Proof The construction that leads to this result is suggested by Fig. 3.6. We unwind both 
FSMs and connect the appropriate outputs from one to the other to produce a circuit that 

(T) 

computes f M xM . Observe that the number of gates in the simulated circuit is T times the 
sum of the number of gates, whereas the depth is T times the depth of the deeper circuit. ■ 



3.1.5 Nondeterministic Finite-State Machines 

The finite-state machine model described above is called a deterministic FSM (DFSM) be- 
cause, given a current state and an input, the next state of the FSM is uniquely determined. 
A potentially more general FSM model is the nondeterministic FSM (NFSM) characterized 
by the possibility that several next states can be reached from the current state for some given 
input letter. 

One might ask if such a model has any use, especially since to the untrained eye a non- 
deterministic machine would appear to be a dysfunctional deterministic one. The value of an 
NFSM is that it may recognize languages with fewer states and in less time than needed by a 
DFSM. The concept of nondeterminism will be extended later to the Turing machine, where 
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it is used to classify languages in terms of the time and space they need for recognition. For 
example, it will be used to identify the class NP of languages that are recognized by nondeter- 
ministic Turing machines in a number of steps that is polynomial in the length of their inputs. 
(See Section 3.9.6.) Many important combinatorial problems, such as the traveling salesperson 
problem, fall into this class. 

The formal definition of the NFSM is given in Section 4.1, where the next-state function 
S : Q x £ h- > Q of the FSM is replaced by a next-state function S : Q x £ i— > 2®. Such 
functions assign to each state q and input letter a a subset 5(q, a) of the set Q of states of the 
NFSM (2*5, the power set, is the set of all subsets of Q. It is introduced in Section 1.2.1.) 
Since the value of 5(q, a) can be the empty set, there may be no successor to the state q on 
input a. Also, since 5(q, a) when viewed as a set can contain more than one element, a state 
q can have edges labeled a to several other states. Since a DFSM has a single successor to each 
state on every input, a DFSM is an NFSM in which 5(q, a) is a singleton set. 

While a DFSM M accepts a string w if w causes M to move from the initial state to a 
final state in F, an NFSM accepts w if there is some set of next-state choices for w that causes 
M to move from the initial state to a final state in F. 

An NFSM can be viewed as a purely deterministic finite-state machine that has two inputs, 
as suggested in Fig. 3.7. The first, the standard input, a, accepts the user's data. The second, 
the choice input, c, is used to choose a successor state when there is more than one. The in- 
formation provided via the choice input is not under the control of the user supplying data via 
the standard input. As a consequence, the machine is nondeterministic from the point of view 
of the user but fully deterministic to an outside observer. It is assumed that the choice agent 
supplies the choice input and, with full knowledge of the input to be provided by the user, 
chooses state transitions that, if possible, lead to acceptance of the user input. On the other 
hand, the choice agent cannot force the machine to accept inputs for which it is not designed. 

In an NFSM it is not required that a state q have a successor for each value of the standard 
and choice inputs. This possibility is captured by allowing 5(q, a, c) to have no value, denoted 
by 6(q, a, c) = ±. 

Figure 3.8 shows an NFSM that recognizes strings over B* that end in 00101. In this 
figure parentheses surround the choice input when its value is needed to decide the next state. 
In this machine the choice input is set to 1 when the choice agent knows that the user is about 
to supply the suffix 00101. 
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Figure 3.7 A nondeterministic finite-state machine modeled as a deterministic one that has a 
second choice input whose value disambiguates the value of the next state. 



100 Chapter 3 Machines with Memory Models of Computation 




Start 

Figure 3.8 A nondeterministic FSM that accepts binary strings ending in 00101. Choice 
inputs are shown in parentheses for those user inputs for which the value of choice inputs can 
disambiguate next-state moves. 

0(0), 1(0) 




Start 

Figure 3.9 An example of an NFSM whose choice agent (its values are in parentheses) accepts 
not only strings in a language L, but all strings. 



Although we use the anthropomorphic phrase "choice agent," it is important to note that 
this choice agent cannot freely decide which strings to accept and which not. Instead, it must 
when possible make choices leading to acceptance. Consider, for example, the machine in 
Fig. 3.9. It would appear that its choice agent can accept strings in an arbitrary language L. In 
fact, the language that it accepts contains all strings. 

Given a string w in the language L accepted by an NFSM, a choice string that leads to its 
acceptance is said to be a succinct certificate for its membership in L. 

It is important to note that the nondeterministic finite-state machine is not a model of 
reality, but is used instead primarily to classify languages. In Section 4.1 we explore the 
language-recognition capability of the deterministic and nondeterministic finite-state machines 
and show that they are the same. However, the situation is not so clear with regard to Turing 
machines that have access to unlimited storage capacity. In this case, we do not know whether 
or not the set of languages accepted in polynomial time on deterministic Turing machines (the 
class P) is the same set of languages that is accepted in polynomial time by nondeterministic 
Turing machines (the class NP) . 

3.2 Simulating FSMs with Shallow Circuits* 

In Section 3.1 we demonstrated that every T-step FSM computation can be simulated by 
a circuit whose size and depth are both 0(T). In this section we show that every T-step 
finite-state machine computation can be simulated by a circuit whose size and depth are 0(T) 
and O(logT), respectively. While this seems a serious improvement in the depth bound, the 
coefficients hidden in the big-0 notation for both bounds depend on the number of states of 
the FSM and can be very large. Nevertheless, for simple problems, such as binary addition, the 
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01,10 



01,10 



01,10 



Figure 3.10 A finite-state machine that adds two binary numbers. Their two least significant 
bits are supplied first followed by those of increasing significance. The output bits represent the 
sum of the two numbers. 



results of this section can be useful. We illustrate this here for binary addition by exhibiting 
small and shallow circuits for the adder FSM of Fig. 3.10. The circuit simulation for this 
FSM produces the carry-lookahead adder circuit of Section 2.7. In this section we use matrix 
multiplication, which is covered in Chapter 6. 

The new method is based on the representation of the function f M : Q x £ i— > Q x VP 
computed in T steps by an FSM M = (£, ^ , Q, S, A, s, F) in terms of the set of state-to- 
state mappings S = {h : Q \— ► Q} where S contains the mappings {A x : Q t— > Q \ x € £} 
and A x is defined below. 



&x(q) =6(q,x) 



(3.1) 



That is, A x (q) is the state to which state q is carried by the input letter x. 

The FSM shown in Fig. 3.10 adds two binary numbers sequentially by simulating a ripple 
adder. (See Section 2.7.) Its input alphabet is B , that is, the set of pairs of 0's and Is. Its 
output alphabet is B and its state set is Q = {qo, qi, qi, q-}}. (A sequential circuit for this 
machine is designed in Section 3.3.) It has the state-to-state mappings shown in Fig. 3.11. 

Let : S 2 <— > S be the operator defined on the set S of state-to-state mappings where for 
arbitrary hi,h 2 G S and state q G Q the operator © is defined as follows: 



(hi O h 2 )(q) = h 2 (hi(q)) 



(3.2) 
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Figure 3. 1 I The state-to-state mappings associated with the FSM of Fig. 3.10. 
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The state-to-state mappings in S will be obtained by composing the mappings {A x : Q i— > 
Q | x G £} using this operator. 

Below we show that the operator is associative, that is, satisfies the property (hi 
h 2 ) /i3 = h\ (h 2 hi). This means that for each q E Q, ((hi © h 2 ) h 3 )(q) = 
(hi (/12 © h 3 ))(q) = h i (h 2 (h i (q))). Applying the definition of in Equation (3.2), we 
have the following for each q 6 Q: 

((hi h 2 ) h 3 )(q) = h((hi h 2 )(q)) 

= h 3 (h 2 (hi(q))) (3.3) 

= (h 2 Qh 3 )(hi(q)) 
= (hi (/i 2 h 3 ))(q) 

Thus, is associative and (S, 0) is a semigroup. (See Section 2.6.) It follows that a prefix 
computation can be done on a sequence of state-to-state mappings. 

We now use this observation to construct a shallow circuit for the function f M . Let w = 
(wi, w 2 , . . . , Wt) be a sequence of T inputs to M where Wj is supplied on the jth step. Let 
qj> be the state of M after receiving the jth input. From the definition of it follows that 
q^' has the following value where s is the initial state of AI: 



qU> = (A Wl Q A W2 Q ■ ■ ■ Q A Wo )(s) 



The value of f\ [ on initial state s and T inputs can be represented in terms of q = (q^ 1 ', ■ ■ ■ , 
q^ T ') as follows: 

f£\s,w) = (qM,\(qM),\(qM),...,\(q^)) 

Let A^ ' be the following sequence of state-to-state mappings: 

A( T ) = (A t0l ,A. !i , 2 ,...,A MT ) 

It follows that q can be obtained by computing the state-to-state mappings A Wl A Wl • ■ ■ 
A w ., 1 < j < T, and applying them to the initial state s. Because is associative, these T 

state-to-state mappings are produced by the prefix operator Vq on the sequence A' ^ (see 
Theorem 2.6.1): 

VP(A^) = (A Wl ,(A Wl A Wl ), ..., (A W] A„ 2 . . . A WT )) 

Restating Theorem 2.6. 1 for this problem, we have the following result. 

THEOREM 3.2. 1 ForT = 2 , k an integer, the T state-to-state mappings defined by the T inputs 
to an FSM KI can be computed by a circuit over the basis ft = {0} whose size and depth satisfy 
the following bounds: 



Cn(v { Q } ) <2T-log 2 T 



Da I PJP) <21og 2 T 
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(T) 

The construction of a shallow Boolean circuit for f K[ is reduced to a five-step problem: 1) 
for each input letter x design a circuit whose input and output are representations of states and 
which defines the state-to-state mapping A x for input letter x; 2) construct a circuit for the 
associative operator that accepts the representations of two state-to-state mappings A y and 
A z and produces a representation for the state-to-state mapping A y A z ; 3) use the circuit 
for in a parallel prefix circuit to produce the T state-to-state mappings; 4) construct a circuit 
that combines the representation of the initial state s with that of the state-to-state mapping 
A Wl A W2 • ■ ■ A Wj to obtain a representation for the successor state A^,, A W2 • ■ ■ 
A w (s); and 5) construct a circuit for A that computes an output from the representation of a 
state. 

We now describe a generic, though not necessarily efficient, implementation of these steps. 

Let Q = {qo, <7i, . . . , Q|q|_ x } be the states of M. The state-to-state mapping A x for the 
FSM M needed for the first step can be represented by a |Q| x \Q\ Boolean matrix N(x) = 
{riij(x)} in which the entry in row i and column j, riij(x), satisfies 



n itj (x) 



1 if M moves from state qi to state qj on input x 
otherwise 



Consider again the FSM shown in Fig. 3.10. The matrices associated with its four pairs of 
inputs x e {(0, 0), (0, 1), (1, 0), (1, 1)} are shown below, where N((0, 1)) = N((l, 0)): 



N((0,0)) 



10 

10 

10 

10 



JV((0,1)) 



10 

10 

10 

10 



N((l,l)) 



10 

10 

1 

1 



From these matrices the generic matrix N((u, v)) parameterized by the values of the inputs (a 
pair (u, v) in this example) is produced from the following Boolean functions: t = u A v, the 
carry-terminate function, p = u ® v, the carry-propagate function, and g = u A V, the 
carry-generate function. 



N((u,v)) 



t p g o 

t p g 

t p g 

t p g 



Let a(i) = (0, 0, . . . , 0, 1, 0, . . . 0) be the unit |<5|-vector that has value 1 in the ith position 
and zeros elsewhere. Let a(i)N(x) denote Boolean vector-matrix multiplication in which ad- 
dition is OR and multiplication is AND. Then, for each i, a(i)N(x) = (n^i, n^j, ■ ■ ■ >^i,|Q|) 
is the unit vector denoting the state that M enters when it is in state qi and receives input x. 
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Let N(x, y) = N(x) X N(y) be the Boolean matrix-matrix multiplication of matrices N(x) 
and N(y) in which addition is OR and multiplication is AND. Then, for each x and y the entry 
in row i and column j of N(x) X N(y), namely n^ i (x, y), satisfies the following identity: 

Kj( x >y) = V n i.t( x ) ■ nt,j{y) 
q t &Q 

(2) / 

That is, n\J(x, y) = 1 if there is a state qt & Q such that in state (?j, A/ is given input x, 
moves to state qt, and then moves to state qj on input y. Thus, the composition operator 
can be realized through the multiplication of Boolean matrices. It is straightforward to show 
that matrix multiplication is associative. (See Problem 3.10.) 

Since matrix multiplication is associative, a prefix computation using matrix multiplica- 
tion as a composition operator for each prefix x^ 3 ' = (xu x%, . . . , Xj) of the input string x 
generates a matrix Nix 1 - 3 ') = N(x\) x N(x2) X • • • X N(xj) defining the state-to-state 
mapping associated with x^ 3 ' for each value of 1 < j < n. 

The fourth step, the application of a sequence of state-to-state mappings to the initial state 
s = q r , represented by the |Q| -vector c(r), is obtained through the vector-matrix multiplica- 
tion a(r)N(x^) for 1 < j < n. 

The fifth step involves the computation of the output word from the current state. Let 
the column |Q|-vector A contain in the ith position the output of the FSM M when in state 
q t . Then, the output produced by the FSM after the jth input is the product a(r)N(x^ 3 ')X. 
This result is summarized below. 

THEOREM 3.2.2 Let the finite-state machine M = (£, ^, Q, 6, A, s, F) with \Q\ states compute 
a subfunction f of f AI in T steps. Then f has the following size and depth bounds over the 
standard basis f2o for some n > 1 : 

Cn (/) = O(M mBtrix (|Q| > «)r) 
£>fi (/) = O((«log|Q|)(logT)) 

Here Af ma t r i x (n, k) is the size of a circuit to multiply two n x n matrices with a circuit of depth 
k log n. These bounds can be achieved simultaneously. 

Proof The circuits realizing the Boolean functions {nij(x) \ 1 < i, j < \Q\}, x an 
input, each have a size determined by the size of the input alphabet E, which is constant. 
The number of operations required to multiply two Boolean matrices with a circuit of depth 
nlog\Q\,K > l.isMmatrixGQl.K)- (See Section 6.3. Note that Af matrix (|Q|, k) < |Q| 3 .) 
Finally, the prefix circuit uses 0(T) copies of the matrix multiplication circuit and has a 
depth of O(logT) copies of the matrix multiplication circuit along the longest path. (See 
Section 2.6.) ■ 

When an FSM has a large number of states but its next-state function is relatively simple, 
that is, it has a size that is at worst a polynomial in log \Q\, the above size bound will be much 
larger than the size bound given in Theorem 3.1.1 because M ma t r i x (n, k) grows exponentially 
in log \Q\. The depth bound grows linearly with log \Q\ whereas the depth of the next- 
state function on which the depth bound of Theorem 3.1.1 depends will typically grow either 
linearly or as a small polynomial in log log \Q\ for an FSM with a relatively simple next-state 
function. Thus, the depth bound will be smaller than that of Theorem 3.1.1 for very large 
values of T, but for smaller values, the latter bound will dominate. 
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3.2. 1 A Shallow Circuit Simulating Addition 

Applying the above result to the adder FSM of Fig. 3.10, we produce a circuit that accepts 
T pairs of binary inputs and computes the sum as T-bit binary numbers. Since this FSM 
has four states, the theorem states that the circuit has size 0(T) and depth O(logT). The 
carry-lookahead adder of Section 2.7 has these characteristics. 

We can actually produce the carry-lookahead circuit by a more careful design of the state- 
to-state mappings. We use the following encodings for states, where states are represented by 
pairs {(c,s)}. 

State Encoding 



q 


c 


s 


qo 








91 





1 


qi 


1 





93 


1 


1 



Since the next-state mappings are the same for inputs 0, 1, and 1,0, we encode an input 
pair (u, v) by (g,p), where g = u A v and p = u © v are the carry-generate and carry- 
propagate variables introduced in Section 2.7 and used above. "With these encodings, the three 
different next-state mappings {Ao,o, Aoj, Aii} defined in Fig. 3.11 can be encoded as shown 
in the table below. The entry at the intersection of row (c, s) and column (p, g) in this table 



is the value (c*, s*) of the generic next-state function (c* 



A Pi g(c, s). (Here we abuse 



notation slightly to let A p>s denote the state-to-state mapping associated with the pair (u, v) 
and represent the state q of M by the pair (c, s).) 





g 








1 




p 





1 





c 


c* s* 


c* s* 


c* s* 











1 


1 





1 
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1 


1 
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1 


1 1 


1 


1 


1 


1 


1 1 



Inspection of this table shows that we can write the following formulas for c* and s* : 

c* = (p A c) V g, s* =p®c 

Consider two successive input pairs (u\,V\) and (u 2 ,v 2 ) and associated pairs (pi,g\) and 
{P2->gi)- If the FSM of Fig. 3.10 is in state (co, So) and receives input (u\,v{), it enters the 
state (cj, S\) = (p\ A Co V gi,pi © Co). This new state can be obtained by combining p\ and 
<7i with Co- Let (02, s 2 ) be the successor state when the mapping A P2ig2 is applied to (c.\, S\). 
The effect of the operator on successive state-to-state mappings A Pigi and A P2j92 is shown 
below, in which (3.2) is used: 



(A 



Pug, ® A p 2 ,92)(q) 



A P2,g2( A Pugt ((co> s o))) 



A 



pi.gi 



(pi A Co V guP\ ©Co) 
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(P2 A (pi A c V g,) V g 2 ,P2 © (Pi A c V gj)) 

((p 2 A pt) A c V {g 2 V p 2 A 5i)),p 2 © (Pi A c V gj) 



(C2,S 2 ) 



It follows that C2 can be computed from p* = p 2 A pi and g* = 52 V p 2 A 51 and Co. The 
value of S2 is obtained fromp2 and C\. Thus the mapping A Pl]gi 0A p , i9 , is defined byp* and 
g* , quantities obtained by combining the pairs (pi, #i) and (p2, (72) using the same associative 
operator o defined for the carry-lookahead adder in Section 2.7.1. 

To summarize, the state-to-state mappings corresponding to subsequences of an input 
string ((mo, t>o), (ui, Vi), . . . , (u n _2, v n -2), (u n —\, v n -\)) can be computed by representing 
this string by the carry-propagate, carry-generate string ((p , go), (pi, 9i), • • • , (p n -2, SVi-2)> 
(p n -i,g n -i)), computing the prefix operation on this string using the operator o, then com- 
puting Ci from Co and the carry-propagate and carry-generate functions for the ith stage and Si 
from this carry-propagate function and Ci_i. This leads to the carry-lookahead adder circuit 
of Section 2.7.1. 



3.3 Designing Sequential Circuits 



Sequential circuits are concrete machines constructed of gates and binary memory devices. 
Given an FSM, a sequential machine can be constructed for it, as we show. 

A sequential circuit is constructed from a logic circuit and a collection of clocked binary 
memory units, as suggested in Figs. 3.12(a) and 3.15. (Shown in Fig. 3.12(a) is a simple 
sequential circuit that computes the EXCLUSIVE OR of the initial value in memory and the 
external input to the sequential circuit.) Inputs to the logic circuit consist of outputs from the 
binary memory units as well as external inputs. The outputs of the logic circuit serve as inputs 
to the clocked binary memory units as well as external outputs. 

A clocked binary memory unit is driven by a clock, a periodic signal that has value 1 (it is 
high) during short, uniformly spaced time intervals and is otherwise (it is low), as suggested 
in Figs. 3.12(b). For correct operation it is assumed that the input to a memory unit does not 
change when the clock is high. Thus, the outputs of a logic circuit feeding the memory units 
cannot change during these intervals. This in turn requires that all changes in the inputs to 
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Figure 3.12 (a) A sequential circuit with one gate and one clocked memory unit computing 
the EXCLUSIVE OR of its inputs; (b) a periodic clock pattern. 
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this circuit be fully propagated to its outputs in the intervals when the clock is low. A circuit 
that operates this way is considered safe. Designers of sequential circuits calculate the time for 
signals to pass through a logic circuit and set the interval between clock pulses to insure that 
the operation of the sequential circuit is safe. 

Sequential circuits are designed from finite-state machines (FSMs) in a series of steps. 
Consider an FSM M = (E, *£, Q, 5, A, s) with input alphabet S, output alphabet ^ , state 
set Q, next-state function S : Q x S i— > Q, output function A : Q i— > ^ , and initial state s. 
(For this discussion we ignore the set of final states; they are important only when discussing 
language recognition.) We illustrate the design of a sequential machine using the FSM of 
Fig. 3.10, which is repeated in Fig. 3.13. 

The first step in producing a sequential circuit from an FSM is to assign unique binary 
tuples to each input letter, output letter, and state (the state-assignment problem). This is 
illustrated for our FSM by the tables of Fig. 3.14 in which the identity encoding is used on 
inputs and outputs. This step can have a large impact on the size of the logic circuit produced. 
Second, tables for S : B < \—* B 1 and A : B 2 i— > B, the next-state and output functions of 
the FSM, respectively, are produced from the description of the FSM, as shown in the same 
figure. Here c* and s* represent the successor to the state (c, s). Third, circuits are designed 
that realize the binary functions associated with c* and s* . Fourth and finally, these circuits are 
connected to clocked binary memory devices, as shown in Fig. 3.15, to produce a sequential 
circuit that realizes the FSM. We leave to the reader the task of demonstrating that these circuits 
compute the functions defined by the tables. (See Problem 3.11.) 

Since gates and clocked memory devices can be constructed from semiconductor materials, 
a sequential circuit can be assembled from physical components by someone skilled in the use 
of this technology. We design sequential circuits in this book to obtain upper bounds on the 
size and depth of the next-state and output functions of a sequential machine so that we can 
derive computational inequalities. 




01,10 



01,10 



01,10 



Figure 3. 1 3 A finite-state machine that simulates the ripple adder of Fig. 2.14. It is in state q r 
if the carry-and-sum pair (c_, + i, Sj) generated by the jth full adder of the ripple adder represents 
the integer r, < r < 3. The output produced is the sum bit. 
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Figure 3. 14 Encodings for inputs, outputs, states, and the next-state and output functions of 
the FSM adder. 
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Figure 3. 1 S A sequential circuit for the FSM that adds binary numbers. 
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3.3.1 Binary Memory Devices 

It is useful to fix ideas about memory units by designing one (a latch) from logic gates. We 
use two latchs to create a flip-flop, the standard binary storage device. A collection of clocked 
flip-flops is called a register. A clocked latch can be constructed from a few AND and NOT 
gates, as shown in Fig. 3.16(a). The NAND gates (they compute NOT of AND) labeled g$ and 
<?4 form the heart of the latch. Consider the inputs to 53 and 34, the lines connected to the 
outputs of NAND gates g\ and g%. If one is set to 1 and the other reset to 0, after all signals 
settle down, p and p* will assume complementary values (one will have value 1 and the other 
will have value 0), regardless of their previous values. The gate with input 1 will assume output 
and vice versa. 

Now if the outputs of g\ and <?2 are both set to 1 and the values previously assumed by p 
and p* are complementary, these values will be retained due to the feedback between 53 and 
94, as the reader can verify. Since the outputs of g\ and <?2 are both 1 when the clock input 
(CLK in Fig. 3.16) has value 0, the complementary outputs of g$ and 34 remain unchanged 
when the clock is low. Since the outputs of a latch provide inputs to the logic-circuit portion 
of a sequential circuit, it is important that the latch outputs remain constant when the clock 
is low. 

When the clock input is 1 , the outputs of g\ and gi are S and R, the Boolean complements 
of S and R. If S and R are complementary, as is true for this latch since R = S, this device 
will store the value of S in p and its complement in p* . Thus, if S = 1, the latch is set to 1, 
whereas if R = 1 (and S = 0) it is reset to 0. This type of device is called a D-type latch. For 
this reason we change the name of the external input to this memory device from S to D. 

Because the output of the D-type latch shown in Fig. 3.16(a) changes when the clock pulse 
is high, it cannot be used as a stable input to a logic circuit that feeds this or another such flip- 
flop. Adding another stage like the first but having the complementary value for the clock 
pulse, as shown in Fig. 3.16(b), causes the output of the second stage to change only while the 
clock pulse is low. The output of the first stage does change when the clock pulse is high to 
record the new value of the state. This is called a master-slave edge-triggered flip-flop. Other 
types of flip-flop are described in texts on computer architecture. 



D=S 




(a) 



(b) 



Figure 3. 1 6 (a) Design of a D-type latch from NAND gates, (b) A master-slave edge-triggered 
D-type flip-flop. 
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3.4 Random-Access Machines 

The random-access machine (RAM) models the essential features of the traditional serial 
computer. The RAM is modeled by two synchronous interconnected FSMs, a central process- 
ing unit (CPU) and a random-access memory. (See Fig. 3.17.) The CPU has a small number 
of storage locations called registers whereas the random-access memory has a large number. 
All operations performed by the CPU are performed on data stored in its registers. This is done 
for efficiency; no increase in functionality is obtained by allowing operations on data stored in 
memory locations as well. 

3.4. 1 The RAM Architecture 

The CPU implements a fetch-and-execute cycle in which it alternately reads an instruction 
from a program stored in the random-access memory (the stored-program concept) and ex- 
ecutes it. Instructions are read and executed from consecutive locations in the random-access 
memory unless a jump instruction is executed, in which case an instruction from a non- 
consecutive location is executed next. 

A CPU typically has five basic kinds of instruction: a) arithmetic and logical instructions of 
the kind described in Sections 2.5.1, 2.7, 2.9, and 2.10, b) memory load and store instructions 
for moving data between memory locations and registers, c) jump instructions for breaking 
out of the current program sequence, d) input and output (I/O) instructions, and e) a halt 
instruction. 

The basic random-access memory has an output word (outjwrd) and three input words, 
an address (addr), a data word (injwrd), and a command (and). The command specifies 
one of three actions, a) read from a memory location, b) write to a memory location, or c) 
do nothing. Reading from address addr deposits the value of the word at this location into 
outjwrd whereas writing to addr replaces the word at this address with the value of in_wrd. 
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Figure 3.17 The random-access machine has a central processing unit (CPU) and a random- 
access memory unit. 
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This memory is called random-access because the time to access a word is the same for all 
words. The Turing machine introduced in Section 3.7 has a tape memory in which the time 
to access a word increases with its distance from the tape head. 

The random-access memory in the model in Fig. 3.17 has m = 2 M storage locations each 
containing a 6-bit word, where (J, and b are integers. Each word has a /i-bit address and the 
addresses are consecutive starting at zero. The combination of this memory and the CPU 
described above is the bounded-memory RAM. When no limit is placed on the number and 
size of memory words, this combination defines the unbounded-memory RAM. We use the 
term RAM for these two machines when context unambiguously determines which is intended. 

DESIGN OF A SIMPLE CPU The design of a simple CPU is given in Section 3.10. (See 
Fig. 3.31.) This CPU has eight registers, a program counter (PC), accumulator (AC), mem- 
ory address register (MAR), memory data register (MDR), operation code (opcode) regis- 
ter (OPC), input register (INR), output register (OUTR), and halt register (HALT). Each 
operation that requires two operands, such as addition or vector AND, uses AC and MDR as 
sources for the operands and places the result in AC. Each operation with one operand, such 
as the NOT of a vector, uses AC as both source and destination for the result. PC contains the 
address of the next instruction to be executed. Unless a jump instruction is executed, PC is 
incremented on the execution of each instruction. If a jump instruction is executed, the value 
of PC is changed. Jumps occur in our simple CPU if AC is zero. 

To fetch the next instruction, the CPU copies PC to MAR and then commands the 
random-access memory to read the word at the address in MAR. This word appears in MDR. 
The portion of this word containing the identity of the opcode is transferred to OPC. The 
CPU then inspects the value of OPC and performs the small local operations to execute the 
instruction represented by it. For example, to perform an addition it commands the arith- 
metic/logical unit (ALU) to combine the contents of MDR and AC in an adder circuit and 
deposit the result in AC. If the instruction is a load accumulator instruction (LDA), the CPU 
treats the bits other than opcode bits as address bits and moves them to the MAR. It then com- 
mands the random-access memory to deposit the word at this address in MDR, after which it 
moves the contents of MDR to AC. In Section 3.4.3 we illustrate programming in an assembly 
language, the language of a machine enhanced by mnemonics and labels. We further illustrate 
assembly-language programming in Section 3.10.4 for the instruction set of the machine de- 
signed in Section 3.10. 

3.4.2 The Bounded-Memory RAM as FSM 

As this discussion illustrates, the CPU and the random-access memory are both finite-state 
machines. The CPU receives input from the random-access memory as well as from external 
sources. Its output is to the memory and the output port. Its state is determined by the 
contents of its registers. The random-access memory receives input from and produces output 
to the CPU. Its state is represented by an m-tuple (wq, W\, . . , , W m —{) of 6-bit words, one 
per memory location, as well as by the values of in_wrd, outjword, and addr. We say that 
the random-access memory has a storage capacity of S = mb bits. The RAM has input and 
output registers (not shown in Fig. 3.17) through which it reads external inputs and produces 
external outputs. 

As the RAM example illustrates, some FSMs are programmable. In fact, a program stored 
in the RAM memory selects one of very many state sequences that the RAM may execute. The 
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number of states of a RAM can be very large; just the random-access memory alone has more 
than 2 states. 

The programmability of the unbounded-memory RAM makes it universal for FSMs, as 
we show in Section 3.4.4. Before taking up this subject, we pause to introduce an assembly- 
language program for the unbounded-memory RAM. This model will play a role in Chapter 5. 

3.4.3 Unbounded-Memory RAM Programs 

We now introduce assembly-language programs to make concrete the use of the RAM. An 
assembly language contains one instruction for each machine-level instruction of a CPU. How- 
ever, instead of bit patterns, it uses mnemonics for opcodes and labels as symbolic addresses. 
Labels are used in jump instructions. 

Figure 3.18 shows a simple assembly language. It implements all the instructions of the 
CPU defined in Section 3.10 and vice versa if the CPU has a sufficiently long word length. 

Our new assembly language treats all memory locations as equivalent and calls them reg- 
isters. Thus, no distinction is made between the memory locations in the CPU and those 
in the random-access memory. Such a distinction is made on real machines for efficiency: it 
is much quicker to access registers internal to a CPU than memory locations in an external 
random-access memory. 

Registers are used for data storage and contain integers. Register names are drawn from the 
set {Ro, Ri, R2, • • •}■ The address of register Rj is i. Thus, both the number of registers and 
their size are potentially unlimited. All registers are initialized with the value zero. Registers 
used as input registers to a program are initialized to input values. Results of a computation 
are placed in output registers. Such registers may also serve as input registers. Each instruc- 
tion may be given a label drawn from the set {No, N\, N%, ■ ■ .}. Labels are used by jump 
instructions, as explained below. 



Instruction 


Meaning 


INC Ri 


Increment the contents of R^ by 1 . 


DECRi 


Decrement the contents of R^ by 1 . 


CLR R; 


Replace the contents of R; with 0. 


Rj < — Rj 


Replace the contents of R; with those of Rj . 


JMP+ N, 


Jump to closest instruction above current one with label N^. 


JMP_ N, 


Jump to closest instruction below current one with label Nj. 


R, JMP+ N 4 


If Rj contains 0, jump to closest instruction above 
current one with label N; . 


Rj JMP_ Ni 


If Rj contains 0, jump to closest instruction below 
current one with label N; . 


CONTINUE 


Continue to next instruction; halt if none. 



Figure 3.18 The instructions in a simple assembly language. 
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The meaning of each instruction should be clear except possibly for the CONTINUE and 
JUMR If the program reaches a CONTINUE statement other than the last CONTINUE, it 
executes the following instruction. If it reaches the last CONTINUE statement, the program 
halts. 

The jump instructions Rj JMP+ Nj, Rj JMP_ Nj, JMP + Nj, and JMP_ Nj cause a 
break in the program sequence. Instead of executing the next instruction in sequence, they 
cause jumps to instructions with labels Nj. In the first two cases these jumps occur only when 
the content of register Kj is zero. In the last two cases, these jumps occur unconditionally. 
The instructions with JMP + (JMP_) cause a jump to the closest instruction with label N^ 
above (below) the current instruction. The use of the suffixes + and — permit the insertion of 
program fragments into an existing program without relabeling instructions. 

A RAM program is a finite sequence of assembly language instructions terminated with 
CONTINUE. A valid program is one for which each jump is to an existing label. A halting 
program is one that halts. 

TWO RAM PROGRAMS We illustrate this assembly language with the two simple programs 
shown in Fig. 3.19. The first adds two numbers and the second uses the first to square a 
number. The heading of each program explains its operation. Registers Ro and Ri contain the 
initial values on which the addition program operates. On each step it increments Ro by 1 and 
decrements R\ by 1 until Ri is 0. Thus, on completion, the value of Ro is its original value 
plus the value of Ri and Ri contains 0. 

The squaring program uses the addition program. It makes three copies of the initial value 
x of Ro and stores them in Ri, R2, and R3. It also clears Rq. R2 will be used to reset Ri to x 
after adding Ri to Ro. R3 is used as a counter and decremented x times, after which x is added 
to zero x times in Rq; that is, x 1 is computed. 



Ro <— Ro + Ri 


Comments 




Ro <— Ro 


Comments 


No Ri JMP_ Nj 


EndifRj =0 




R2 <- Ro 


Copy Ro (x) to R 2 


INCRo 


Increment Ro 




R3 *~~ Ro 


Copy Ro (x) to R3 


DECRi 


Decrement Ri 




CLRRo 


Clear the contents of Ro 


JMP+ N 


Repeat 


N 2 


Ri <— R2 


Copy R 2 (x) to Ri 


Ni CONTINUE 




N 

Nj 


Rj JMP_ Ni 
INCRo 
DECRj 
JMP+ N 
CONTINUE 


Ro <— Ro + Ri 






DECR 3 


Decrement R3 








R 3 JMP_ N 3 


End when zero 








JMP+ N 2 


Add x to Rq 






Ni 


CONTINUE 





Figure 3. 1 9 Two simple RAM programs. The first adds two integers stored initially in registers 
Ro and Ri , leaving the result in Ro . The second uses the first to square the contents of Ro , leaving 
the result in Ro . 
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As indicated above, with large enough words each of the above assembly-language instruc- 
tions can be realized with a few instructions from the instruction set of the CPU designed in 
Section 3.10. It is also true that each of these CPU instructions can be implemented by a 
fixed number of instructions in the above assembly language. That is, with sufficiently long 
memory words in the CPU and random-access memory, the two languages allow the same 
computations with about the same use of time and space. 

However, the above assembly-language instructions are richer than is absolutely essential 
to perform all computations. In fact with just five assembly-language instructions, namely 
INC, DEC, CONTINUE, R, JMP+ N;, and R, JMP_ N», all the other instructions can be 
realized. (See Problem 3.21.) 

3.4.4 Universality of the Unbounded-Memory RAM 

The unbounded-memory RAM is universal in two senses. First, it can simulate any finite- 
state machine including another random-access machine, and second, it can execute any RAM 
program. 

DEFINITION 3.4. 1 A machine M is universal for a class of machines C if every machine inC can 
be simulated by AI . (A stronger definition requiring that M also be in C is used in Section 3.8.) 

We now show that the RAM is universal for the class C of finite-state machines. We show 
that in 0(T) steps and with constant storage capacity S the RAM can simulate T steps of any 
other FSM. Since any random-access machine that uses a bounded amount of memory can be 
described by a logic circuit such as the one defined in Section 3.10, it can also be simulated by 
the RAM. 

THEOREM 3.4. 1 Every T-step FSM M = (£, ^, Q, 5, A, s, F) computation can be simidated 
by a RAM in 0(T) steps with constant space. Fhus, the RAM is universal for finite-state machines. 

Proof We sketch a proof. Since an FSM is characterized completely by its next-state and 
output functions, both of which are assumed to be encoded by binary functions, it suffices to 
write a fixed-length RAM program to perform a state transition, generate output, and record 
the FSM state in the RAM memory using the tabular descriptions of the next-state and 
output functions. This program is then run repeatedly. The amount of memory necessary 
for this simulation is finite and consists of the memory to store the program plus one state 
(requiring at least log 2 |Q| bits). While the amount of storage and time to record and 
compute these functions is constant, they can be exponential in log 2 \Q\ because the next- 
state and output functions can be a complex binary function. (See Section 2.12.) Thus, the 
number of steps taken by the RAM per FSM state transition is constant. ■ 

The second notion of universality is captured by the idea that the RAM can execute RAM 
programs. We discuss two execution models for RAM programs. In the first, a RAM program 
is stored in a private memory of the RAM, say in the CPU. The RAM alternates between 
reading instructions from its private memory and executing them. In this case the registers 
described in Section 3.4.3 are locations in the random-access memory. The program counter 
either advances to the next instruction in its private memory or jumps to a new location as a 
result of a jump instruction. 

In the second model (called by some [10] the random-access stored program machine 
(RASP)), a RAM program is stored in the random-access memory itself. A RAM program 
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can be translated to a RASP program by replacing the names of RAM registers by the names 
of random-access memory locations not used for storing the RAM program. The execution 
of a RASP program directly parallels that of the RAM program; that is, the RASP alternates 
between reading instructions and executing them. Since we do not consider the distinction 
between RASP and RAM significant, we call them both the RAM. 



3.5 Random-Access Memory Design 



In this section we model the random-access memory described in Section 3.4 as an FSM 
MrmemCM' b) that has m = 2 M 6-bit data words, Wq, W\, . . . , w m —\, as well as an input 
data word d (in_wrd), an input address a (addr), and an output data word z (out_wrd). (See 
Fig. 3.20.) The state of this FSM is the concatenation of the contents of the data, input and 
output words, input address, and the command word. We construct an efficient logic circuit 
for its next-state and transition function. 

To simplify the design of the FSM Mrmem we use the following encodings of the three 
input commands: 



Name 


S\ 


so 


no-op 
read 








1 


write 


1 






An input to Mrmem is a binary (/! + b + 2)-bit binary tuple, two bits to represent a 
command, fi bits to specify an address, and b bits to specify a data word. The output function 
of Mriviem> ArmeMi is a simple projection operator and is realized by a circuit without any 
gates. Applied to the state vector, it produces the output word. 

We now describe a circuit for <5rmem> the next-state function of -A/rmem- Memory words 
remain unchanged if either no-op or read commands are executed. In these cases the value 
of the command bit Si is 0. One memory word changes if Sj = 1, namely, the one whose 
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Figure 3.20 A random-access memory unit A/rmem that holds m fe-bit words. Its inputs 
consist of a command (cmd), an input word (in_wrd), and an address (addr). It has one output 
word (out_wrd). 
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address is a. Thus, the memory words ivq, W\, . . . , w m _\ change only when S\ = 1. The 
word that changes is determined by the /i-bit address a supplied as part of the input. Let 
Q/x—i, . . . , a\, a be the p. bits of a. Let these bits be supplied as inputs to an /i-bit decoder 
function /^ codo (see Section 2.5.4). Let y m -\, • ••>2/l>2/o be tne m outputs of a decoder 
circuit. Then, the Boolean function Cj = Sij/j (shown in Fig. 3.21(a)) is 1 exactly when 
the input address a is the binary representation of the integer i and the FSM Mrmem is 
commanded to write the word d at address a. 

Let Wq, W*, . . . , w^ n _ j be the new values for the memory words. Let w*a and Wij be the 
jth components of W* and Wi, respectively. Then, for < i < m — 1 and < j < b — 1 we 
write w* ■ in terms of Wij and the jth component dj of d as follows: 






z,j V C^Clj 



Figures 3.21(a) and (b) show circuits described by these formulas. It follows that changes 
to memory words can be realized by a circuit containing Cq ( /decode ) S ates f° r tne decoder, 
m gates to compute all the terms c,, < % < m — 1, and 4m6 gates to compute w* , < i < 
m — 1, < j < 6 — 1 (NOTs are counted). Combining this with Lemma 2.5.4, we have that 




2/o A w j J/2 A w 2 ,j 

2/i A w hj 



-hj 



(a) 



(b) 



Figure 3.21 A circuit that realizes the next-state and output function of the random-access 
memory. The circuit in (a) computes the next values {w*j} for components of memory words, 
whereas that in (b) computes components {z*} of the output word. The output j/j A Wij of (a) 
is an input to (b). 



©John E Savage 3.6 Computational Inequalities for the RAM 117 

a circuit realizing this portion of the next-state function has at most m(4b + 2) + (2/i — 2) y/m 
gates. The depth of this portion of the circuit is the depth of the decoder plus 4 because the 
longest path between an input and an output Wq, it); , . . . , w* n _ 1 is through the decoder and 
then through the gates that form CiWij. This depth is at most [log 2 fi\ + 5. 

The circuit description is complete after we give a circuit to compute the output word z. 
The value of z changes only when so = 1. that is, when a read command is issued. The jth 
component of z, namely Zj, is replaced by the value oiwij, where i is the address specified by 
the input a. Thus, the new value of Zj, z* , can be represented by the following formula (see 
the circuit of Fig. 3.21(b)): 



z 

3 



s Zj V so I \f VkWkj I for < j < b - 1 



^fe=0 



Here V denotes the OR of the m terms HkWk,j, m = 2 M . It follows that for each value of 
j this portion of the circuit can be realized with m two-input AND gates and m — 1 two-input 
OR gates (to form \J) plus four additional operations. Thus, it is realized by an additional 
(2m + 3)b gates. The depth of this circuit is the depth of the decoder (|~log//| + 1) plus 
fi = log 2 m, the depth of a tree of m inputs to form \J, plus three more levels. Thus, the 
depth of the circuit to produce the output word is fi + |~log 2 fi\ + 4. 

The size of the complete circuit for the next-state function is at most m(6b + 2) + [2[i — 
2) \Jm + 36. Its depth is at most /! + |~log 2 fi\ + 4. We state these results as a lemma. 

LEMMA 3.5.1 The next-state and output functions of the FSM -/Wrmem(a*> b), <5r,mem and 

Armem. can be realized with the following size and depth bounds over the standard basis Qq, 
where S = mb is its storage capacity in bits: 

Cf2 (<5R.MEM,A RM EM) < m(6b + 2) + (2/i- 2)y/m + 3b = O(S) 
A2„(<5rmem, Armem) < A* + r io S2 Ml + 4 = 0(log(S/b)) 

Random-access memories can be very large, so large that their equivalent number of logic 
elements (which we see from the above lemma is proportional to the storage capacity of the 
memory) is much larger than the tens to hundreds of thousands of logic elements in the CPUs 
to which they are attached. 



3.6 Computational Inequalities for the RAM 

We now state computational inequalities that apply for all computations on the bounded- 
memory RAM. Since this machine consists of two interconnected synchronous FSMs, we 
invoke the inequalities of Theorem 3.1.3, which require bounds on the size and depth of the 
next-state and output functions for the CPU and the random-access memory. 

From Section 3.10.6 we see that size and depth of these functions for the CPU grow slowly 
in the word length b and number of memory words m. In Section 3.5 we designed an FSM 
modeling an S'-bit random-access memory and showed that the size and depth of its next-state 
and output functions are proportional to S and log S, respectively. Combining these results, 
we obtain the following computational inequalities. 
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THEOREM 3.6. 1 Let f be a subfunction c/Z^m' > the function computed by the m-word, b-bit 
RAM with storage capacity S = mb in T steps. Then the following bounds hold simultaneously 
over the standard basis fio for logic circuits: 

CnS!) = O(ST) 
D no (f) = 0(TlogS) 

The discussion in Section 3.1.2 of computational inequalities for FSMs applies to this the- 
orem. In addition, this theorem demonstrates the importance of the space-time product, ST, 
as well as the product T log S. While intuition may suggest that ST is a good measure of the 
resources needed to solve a problem on the RAM, this theorem shows that it is a fundamental 
quantity because it directly relates to another fundamental complexity measure, namely, the 
size of the smallest circuit for a function /. Similar statements apply to the second inequality. 

It is important to ask how tight the inequalities given above are. Since they are both derived 
from the inequalities of Theorem 3.1.1, this question can be translated into a question about 
the tightness of the inequalities of this theorem. The technique given in Section 3.2 can be 
used to tighten the second inequality of Theorem 3.1.1 so that the bounds on circuit depth 
can be improved to logarithmic in T without sacrificing the linearity of the bound on circuit 
size. However, the coefficients on these bounds depend on the number of states and can be 
very large. 



3.7 Turing Machines 



The Turing machine model is the classical model introduced by Alan Turing in his famous 
1936 paper [338]. No other model of computation has been found that can compute func- 
tions that a Turing machine cannot compute. The Turing machine is a canonical model of 
computation used by theoreticians to understand the limits on serial computation, a topic 
that is explored in Chapter 5. The Turing machine also serves as the primary vehicle for the 
classification of problems by their use of space and time. (See Chapter 8.) 

The (deterministic) one-tape, bounded-memory Turing machine (TM) consists of two 
interconnected FSMs, a control unit and a tape unit of potentially unlimited storage capacity. 



1 2 



771—1 



□ * * * 



Tape Unit 



Control 
Unit 



Figure 3.22 A bounded-memory one-tape Turing machine. 
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(It is shown schematically in Fig. 3.22.) At each unit of time the control unit accepts input 
from the tape unit and supplies output to it. The tape unit produces the value in the cell 
under the head, a 6-bit word, and accepts and writes a 6-bit word to that cell. It also accepts 
commands to move the head one cell to the left or right or not at all. The bounded-memory 
tape unit is an array of m 6-bit cells and has a storage capacity of S = mb bits. A formal 
definition of the one-tape deterministic Turing machine is given below. 

DEFINITION 3.7. 1 A standard Turing machine (TM) is a six-tuple AI = {V, /3, Q, 5, s, h), 
where T is the tape alphabet not containing the blank symbol (3, Q is the finite set of states, 
5 : Q x (r U {/?}) h(Qu {h}) x (r U {/?}) x {L, N, R} is the next-state function, s is 
the initial state, and h ^ Q is the accepting halt state. A TM cannot exit from h. IfiM is in 
state q with letter a under the tape head and S(q, a) = (q' , a' , C), its control unit enters state q' 
and writes a' in the cell under the head, and moves the head left (if possible), right, or not at all if 
C is L, R, or N, respectively. 

The TM M accepts the input string w € T* (it contains no blanks) if, when started in 
state s with w placed left-adjusted on its otherwise blank tape and the tape head at the leftmost 
tape cell, the last state entered by M is h. IfM has other halting states (states from which it does 
not exit) these are rejecting states. Also, M may not halt on some inputs. 

M accepts the language L(M) consisting of all strings accepted by M. If a Turing machine 
halts on all inputs, we say that it recognizes the language that it accepts. Tor simplicity, we 
assume that when M halts during language acceptance it writes the letter 1 in its first tape cell if its 
input string is accepted and otherwise. 

The function computed by a Turing machine on input string w is the string z written 
leftmost on the non-blank portion of the tape after halting. The function computed by a TM is 
partial if the TM fails to halt on some input strings and complete otherwise. 

Thus, a TM performs a computation on input string w, which is placed left-adjusted on 
its tape by placing its head over the leftmost symbol of w and repeatedly reading the symbol 
under the tape head, making a state change in its control unit, and producing a new symbol 
for the tape cell and moving the head left or right by one cell or not at all. The head does not 
move left from the leftmost tape cell. If a TM is used for language acceptance, it accepts w by 
halting in the accepting state h. If the TM is used for computation, the result of a computation 
on input w is the string z that remains on the non-blank portion of its tape. 

We require that M store the letter 1 or in its first tape cell when halting during language 
acceptance to simplify the construction of a circuit simulating AI in Section 3.9.1. This re- 
quirement is not essential because the fact that AI has halted in state h can be detected with a 
simple circuit. 

The multi-tape Turing machine is a generalization of this model that has multiple tape 
units. (These models and limits on their ability to solve problems are examined in Chapter 5, 
where it is shown that the multi-tape TM is no more powerful than the one-tape TM.) Al- 
though in practice a TM uses a bounded number of memory locations, the full power of TMs 
is realized only when they have access to an unbounded number of tape cells. 

Although the TM is much more limited than the RAM in the flexibility with which it can 
access memory, given sufficient time and storage capacity they both compute exactly the same 
set of functions, as we show in Section 3.8. 

A very important class of languages recognized by TMs is the class P of polynomial-time 
languages. 
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DEFINITION 3.7.2 A language L C T* is in P if there is a Turing machine M with tape alphabet 
r and a polynomial p(n) such that, for every w € T*, a) M halts in p{\w\) steps and b) M 
accepts w if and only if it is in L. 

The class P is said to contain all the "feasible" languages because any language requiring 
more than a polynomial number of steps for its recognition is thought to require so much time 
for long strings as not to be recognizable in practice. 

A second important class of languages is NP, the languages accepted in polynomial time 
by nondeterministic Turing machines. To define this class we introduce the nondeterministic 
Turing machines. 

3.7.1 Nondeterministic Turing Machines 

A nondeterministic Turing machine (NDTM) is identical to the standard TM except that 
its control unit has an external choice input. (See Fig. 3.23.) 

DEFINITION 3.7.3 A non-deterministic Turing machine (NDTM) is the extension of the TM 
model by the addition of a choice input to its control unit. Thus an NDTM is a seven-tuple 
M = (E, r, (3, Q, S, s, h), where £ is the choice input alphabet, F is the tape alphabet not 
containing the blank symbol /3, Q is the finite set of states, s is the initial state, and h ^ Q 

is the accepting halt state. A TM cannot exitfiom h. When M is in state q with letter a under 
the tape head, reading choice input c, its next-state function 5 : Q x £ x (r U {/3}) i— > 
(Q U {h}) x (r U {/3}) x {L, R, N} U _L has value 6(q, c, a). If S(q, c, a) = _L, there is no 
successor to the current state with choice input c and tape symbol a. If 5(q,c,a) = (q',a',C), M's 
control unit enters state q' , writes a' in the cell under the head, and moves the head left (if possible), 
right, or not at all if C is L, R, or N, respectively. The choice input selects possible transitions on 
each time step. 

An NDTM AI reads one character of its choice input string c 6 S* on each step. An 
ND TM M accepts string w if there is some choice string c such that the last state entered by M is 
h when M is started in state s with w placed left-adjusted on its otherwise blank tape and the tape 
head at the leftmost tape cell. We assume that when M halts during language acceptance it writes 
the letter 1 in its first tape cell if its input string is accepted and otherwise. 

An ND TM AI accepts the language L(M) C T* consisting of those strings w that it accepts. 
Thus, ifw L(M), there is no choice input for which M accepts w. 

Note that the choice input c associated with acceptance of input string w is selected with full 
knowledge of w. Also, note that an NDTM does not accept any string not in L(M); that is, 
for no choice inputs does it accept such a string. 

The NDTM simplifies the characterization of languages. It is used in Section 8.10 to 
characterize the class NP of languages accepted in nondeterministic polynomial time. 

DEFINITION 3.7.4 A language LCT* is in NP if there is a nondeterministic Turing machine 
M and a polynomial p(n) such that M accepts L and for each w G L there is a choice input c 
such that M on input w with this choice input halts in p{\w\) steps. 

A choice input is said to "verify" membership of a string in a language. The particular 
string provided by the choice agent is a verifier for the language. The languages in NP are thus 
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Figure 3.23 A nondeterministic Turing machine modeled as a deterministic one whose control 
unit has an external choice input that disambiguates the value of its next state. 



easy to verify: they can be verified in a polynomial number of steps by a choice input string of 
polynomial length. 

The class NP contains many important problems. The Traveling Salesperson Problem 
(TSP) is in this class. TSP is a set of strings of the following kind: each string contains an 
integer n, the number of vertices (cities) in an undirected graph G, as well as distances between 
every pair of vertices in G, expressed as integers, and an integer k such that there is a path that 
visits each city once, returning to its starting point (a tour), whose length is at most k. A 
verifier for TSP is an ordering of the vertices such that the total distance traveled is no more 
than k. Since there are n\ orderings of the n vertices and n! is approximately \/2imn n e~ n , a 
verifier can be found in a number of steps exponential in n; the actual verification itself can be 
done in 0(n 2 ) steps. (See Problem 3.24.) NP also contains many other important languages, 
in particular, languages defining important combinatorial problems. 

While it is obvious that P is a subset of NP, it is not known whether they are the same. 

Since for each language L in NP there is a polynomial p such that for each string w in L 

there is a verifying choice input c of length p(|to|), a polynomial in the length of w, the 

number of possible choice strings c to be considered in search of a verifying string is at most 

an exponential in \w\. Thus, for every language in NP there is an exponential-time algorithm 

to recognize it. 

? 
Despite decades of research, the question of whether P is equal to NP, denoted P = NP, 

remains open. It is one of the great outstanding questions of computer science today. The 
approach taken to this question is to identify NP-complete problems (see Section 8.10), the 
hardest problems in NP, and then attempt to determine problems whether or not such prob- 
lems are in P. TSP is one of these NP-complete problems. 



3.8 Universality of the Turing Machine 



We show the existence of a universal Turing machine in two senses. On the one hand, we show 
that there is a Turing machine that can simulate any RAM computation. Since every Turing 
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machine can be simulated by the RAM, the Turing machine simulating a RAM is universal for 
the set of all Turing machines. 

Also, because there is a Turing machine that can simulate any RAM computation, every 
RAM program can be simulated on this Turing machine. Since it is not hard to see that every 
Turing machine can be described by a RAM program (see Problem 3.29), it follows that the 
RAM programs are exactly the programs computed by Turing machines. Consequently, the 
RAM is also universal. 

The following theorem demonstrates that RAM computations can be simulated by Turing- 
machine computations and vice versa when each operates with bounded memory. Note that 
all halting computations are bounded-memory computations. A direct proof of the existence 
of a universal Turing machine is given in Section 5.5. 

THEOREM 3.8. 1 Let S = mbandm > b. Then for everym-word, b-bit Turing machine Mtm 
(with storage capacity S) there is an O(m)-word, b-bit RAM that simulates a time T computation 
of Mtm in time 0(T) and storage O(S). Similarly, for every m-word, b-bit RAM A/ram 
there is an 0((m/b) log m)-word, 0(b)-bit Turing machine that simulates a T-time, S-storage 
computation of Mr am in time 0(ST log S) and storage 0(Slog S). 

Proof We begin by describing a RAM that simulates a TM. Consider a 6-bit RAM program 
to simulate an m-word, 6-bit TM. As shown in Theorem 3.4.1, a RAM program can be 
written to simulate one step of an FSM. Since a TM control unit is an FSM, it suffices to 
exhibit a RAM program to simulate a tape unit (also an FSM); this is straightforward, as 
is combining the two programs. If the RAM has storage capacity proportional to that of 
the TM, then the RAM need only record with one additional word the position of the tape 
head. This word, which can be held in a RAM register, is incremented or decremented as 
the head moves. The resulting program runs in time proportional to the running time of 
the TM. 

We now describe a 6*-bit TM that simulates a RAM, where 6* = [log m] + 6 + c for 
some constant c, an assumption we examine later. Let RAM words and their corresponding 
addresses be placed in individual cells on the tape of the TM, as suggested in Fig. 3.24. Let 
the address addr of the RAM CPU program counter be placed on the tape of the TM to the 
left, as suggested by the shading in the figure. (It is usually assumed that, unlike the RAM, 
the TM holds words of size no larger than 0(6) in its control unit.) The TM simulates 
a RAM by simulating the RAM fetch- and-execute cycle. This means it fetches a word at 
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Figure 3.24 Organization of a tape unit to simulate a RAM. Each RAM memory word Wj is 
accompanied by its address j in binary. 
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address addr in the simulated RAM memory unit, interprets it as an instruction, and then 
executes the instruction (which might require a few additional accesses to the memory unit 
to read or write data). We return to the simulation of the RAM CPU after we examine the 
simulation of the RAM memory unit. 

The TM can find a word at location addr as follows. It reads the most significant bit 
of addr and moves right on its tape until it finds the first word with this most significant 
bit. It leaves a marker at this location. (The symbol <) in Fig. 3.24 identifies the first place 
a marker is left.) It then returns to the left-hand end of the tape and obtains the next most 
significant bit of addr. It moves back to the marker (} and then carries this marker forward 
to the next address containing the next most significant bit (identified by the marker A in 
Fig. 3.24). This process is repeated until all bits of addr have been visited, at which point 
the word at location addr in the simulated RAM is found. Since to tape unit cells are used 
in this simulation, at most 0(m log to) TM steps are taken for this purpose. 

The TM must also simulate internal RAM CPU computations. Each addition, sub- 
traction, and comparison of 6-bit words can be done by the TM control unit in a constant 
number of steps, as can the logical vector operations. (For simplicity, we assume that the 
RAM does not use its I/O registers. To simulate these operations, either other tapes would 
be used or space would be reserved on the single tape to hold input and output words.) The 
jump instructions as well as the incrementing of the program counter require moving and 
incrementing [log to] -bit addresses. These cannot be simulated by the TM control unit 
in a constant number of steps since it can only operate on 6-bit words. Instead, they are 
simulated on the tape by moving addresses in 6-bit blocks. If two tape cells are separated 
by q — 1 cells, 2q steps are necessary to move each block of 6 bits from the first cell to the 
second. Thus, a full address can be moved in 2g[[logm]/6] steps. An address can also 
be incremented using ripple addition in [[log m] /6] steps using operations on 6-bit words, 
since the blocks of an address are contiguous. (See Section 2.7 for a discussion of ripple 
addition.) Thus, both of these address-manipulation operations can be done in at most 
0(m [ [log to] /6] ) steps, since no two words are separated by more than O(m) cells. 

Now consider the general case of a TM with word size comparable to that of the RAM, 

that is, a size too small to hold an address as well as a word. In particular, consider a TM with 

6-bit tape alphabet where 6 = cb, c > la constant. In this case, we divide addresses into 

[log m] jb 6-bit words and place these words in locations that precede the value of the 

RAM word at this address, as suggested in Fig. 3.40. We also place the address addr at the 
beginning of the tape in the same number of tape words. A total of 0( (to/6) (log m)) 0(b)- 
bit words are used to store all this data. Now assume that the TM can carry the contents of 
a 6-bit word in its control unit. Then, as shown in Problem 3.26, the extra symbols in the 
TM's tape alphabet can be used as markers to find a word with a given address in at most 
(9((m/6)(log to)) TM steps using storage 0((m/b) logm). Hence each RAM memory 
access translates into 0((m/6)(log to)) TM steps on this machine. 

Simulation of the CPU on this machine is straightforward. Again, each addition, sub- 
traction, comparison, and logical vector operation on 6-bit words can be done in a constant 
number of steps. Incrementing of the program counter can also be done in [[log to] /6] 
operations since the cells containing this address are contiguous. However, since a jump op- 
eration may require moving an address by 0(m) cells in the 6*-bit TM, it may now require 

moving it by 0(m(log m)/6) cells in the 6-bit TM in O I to ((log to) /6) j steps. 
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Combining these results, we see that each step of the RAM may require as many as 
<9((m((logm)/6) 2 ) steps of the 6-bit TM. This machine uses storage 0((m/b)logm). 
Since m = S/b, the conclusion of the theorem follows. ■ 

This simulation of a bounded-memory RAM by a Turing machine assumes that the RAM 
has a fixed number of memory words. Although this may appear to prevent an unbounded- 
memory TM from simulating an unbounded-memory RAM, this is not the case. If the Turing 
machine detects that an address contains more than the number of bits currently assumed 
as the maximum number, it can increase by 1 the number of bits allocated to each memory 
location and then resume computation. To make this adjustment, it will have to space out the 
memory words and addresses to make space for the extra bits. (See Problem 3.28.) 

Because a Turing machine with no limit on the length of its tape can be simulated by a 
RAM, this last observation demonstrates the existence of universal Turing machines, Tur- 
ing machines with unbounded memory (but with fixed-size control units and bounded-size 
tape alphabets) that can simulate arbitrary Turing machines. This matter is also treated in 
Section 5.5. 

Since the RAM can execute RAM programs, the same is true of the Turing machines. As 
mentioned above, it is not hard to see that every Turing machine can be simulated by a RAM 
program. (See Problem 3.29.) As a consequence, the RAM programs are exactly the programs 
that can be computed by a Turing machine. 

While the above remarks apply to the one-tape Turing machine, they also apply to all other 
Turing machine models, such as double-ended and multi-tape Turing machines, because each 
of these can also be simulated by the one-tape Turing machine. (See Section 5.2.) 



3.9 Turing Machine Circuit Simulations 



Just as every T-step finite-state machine computation can be simulated by a circuit, so can 
every T-step Turing machine computation. We give two circuit simulations, a simple one that 
demonstrates the concept and another more complex one that yields a smaller circuit. We use 
these two simulations in Sections 3.9.5 and 3.9.6 to establish computational inequalities that 
must hold for Turing machines. With a different interpretation they provide examples of P- 
complete and NP-complete problems. (See also Sections 8.9 and 8.10.) These results illustrate 
the central role of circuits in theoretical computer science. 

3.9.1 A Simple Circuit Simulation of TM Computations 

We now design a circuit simulating a computation of a Turing machine M that uses m memory 
cells and T steps. Since the only difference between a deterministic and nondeterministic 
Turing machine is the addition of a choice input to the control unit, we design a circuit for a 
nondeterministic Turing machine. 

For deterministic computations, the circuit simulation provides computational inequalities 
that must be satisfied by computational resources, such as space and time, if a problem is to be 
solved by M. Such an inequality is stated at the end of this section. 

With the proper interpretation, the circuit simulation of a deterministic computation is an 
instance of a P-complete problem, one of the hardest problems in P to parallelize. Here P is 
the class of polynomial-time languages. A first P-complete problem is stated in the following 
section. This topic is studied in detail in Section 8.9. 
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For nondeterministic computations, the circuit simulation produces an instance of an NP- 
complete problem, a hardest problem to solve in NP. Here NP is the class of languages accepted 
in polynomial time by a nondeterministic Turing machine. A first NP-complete problem is 
stated in the following section. This topic is studied in detail in Section 8.10. 

THEOREM 3.9.1 Any computation performed by a one-tape Turing machine M, deterministic or 
nondeterministic, on an input string w in T steps using m b-bit memory cells can be simulated 
by a circuit Cm,t over the standard complete basis Q of size and depth O(ST) and 0(T log S), 
respectively, where S = mb is the storage capacity in bits of M s tape. For the deterministic TM 
the inputs to this circuit consist of the values ofw. For the nondeterministic TM the inputs consist 
of w and the Boolean choice input variables whose values are not set in advance. 

Proof To construct a circuit Cm.t simulating T steps by M is straightforward because M 
is a finite-state machine now that its storage capacity is limited. We need only extend the 
construction of Section 3.1.1 and construct a circuit for the next-state and output functions 
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Figure 3.25 The circuit Cm.t simulates an m-cell, T-step computation by a nondeterministic 
Turing machine M. It contains T copies of M's control unit circuit and T column circuits, Ct, 
each containing cell circuits Cj,t , < j < m — 1 , 1 < t < T, simulating the j th tape cell on the 
ith time step, qt and Ct are A/'s state on the ith step and its ith set of choice variables. Also, a,j,t 
is the value in the jth cell on the ith step, Sj,t is 1 if the head is over cell j at the ith time step, and 
Vj,t is dj,t if Sj,t = 1 and otherwise. Vt, the vector OR of Vj,t, < j < m — 1, supplies the 
value under the head to the control unit, which computes head movement commands, ht, and 
a new word, VJt, for the current cell in the next simulated time step. The value of the function 
computed by M resides on its tape after the Tth step. 
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of M. As shown in Fig. 3.25, it is convenient to view M as a pair of synchronous FSMs 
(see Section 3.1.4) and design separate circuits for Ms control and tape units. The design 
of the circuit for the control unit is straightforward since it is an unspecified NFSM. The 
tape circuit, which realizes the next-state and output functions for the tape unit, contains 
m cell circuits, one for each cell on the tape. We denote by Ct(ra), 1 < t <T, the ith tape 
circuit. We begin by constructing a tape circuit and determining its size and depth. 

For < j < m and 1 < t < T let Cjj be the jth cell circuit of the tth tape circuit, 
Ct(m). Cj y t produces the value aj t t contained in the jth cell after the jth step as well as 
Sj,t, whose value is 1 if the head is over the jth tape cell after the ith step and otherwise. 



The value of a 7i t is either a 



j,t-i if Sj,t 



(the head is not over this cell) orw if s 



j-i 



1 



(the head is over the cell). Subcircuit SC2 of Fig. 3.26 performs this computation. 

Subcircuit SC\ in Fig. 3.26 computes Sj tt from Sj_\ it _j, Sj,t-\, Sj + \ t t-\ and the triple 
h t = {h~[ , h° t , hj ), where hj =1 if the head moves to the next lower-numbered cell, 
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1 if it moves to the next higher-numbered cell, or h ® 
1 if Sj+i^-i = 1 and h^ = 1, or if Sj-u-i and h~l 



1 if it does not move. Thus, 
1, or if Sjj-\ = 1 and 



1. Otherwise, s 
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Subcircuit SC$ of cell circuit Cj :t generates the 6-bit word Vj >t that is used to provide 
the value under the head on the ith step. Vj t is djj if the head is over the jth cell on the 



Sj,t-1 




Figure 3.26 The cell circuit Cj,t has three components: SCi, a circuit to compute the new 
value for the head location bit Sj,t from the values of this quantity on the preceding step at 
neighboring cells and the head movement vector ht, SCi, a circuit to replace the value in the jth 
cell on the t step with the input iv if the head is over the cell on the (t — l)st step (sj,t_i = 1), 
and SC}, a circuit to produce the new value in the jth cell at the ith step if the head is over this 
cell (sj,t — 1) and the zero vector otherwise. The circuit Cj,t has 5(6 + 1) gates and depth 4. 
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ith step (sjj = 1) and otherwise. The vector-OR of Vj tt , < j < m — 1, is fotmed using 
the tree circuit shown in Fig. 3.25 to compute the value of the 6-bit word v t under the head 
after the ith step. (This can be done by b balanced binary OR trees, each with size m — \ 
and depth |~log 2 rn\ .) v t is supplied to the ith copy of the control unit circuit, which also 
uses the previous state of the control unit, qt, and the choice input c t (a tuple of Boolean 
variables) to compute the next state, qt+i, the new 6-bit word Wt+i for the current tape cell, 
and the head movement command h t +\. 

Summarizing, it follows that the ith tape circuit, Ct(m), uses O(S) gates (here 5 = mb) 
and has depth 0(log S/b). 

Let C con t r oi and -D con t ro i be the size and depth of the circuit simulating the control 
unit. It follows that the circuit simulating T computation steps by a Turing machine M has 
T C contro i gates in the T copies of the control unit and 0{ST) gates in the tape circuits for a 
total of O(ST) gates. Since the longest path through the circuit of Fig. 3.26 passes through 
each control and tape circuit, the depth of this circuit is 0(T(D contm i + log S/b)) = 
O(TlogS). 

The simulation of M is completed by placing the head over the zeroth cell by letting 
s 00 = 1 and Sjfi = for j =£ 0. The inputs to M are fixed by setting CLj >0 = Wj for 
< j < n — I and to the blank symbol for j > n. Finally, v is set equal to a Ji0 , the 
value under the head at the start of the computation. The choice inputs are sets of Boolean 
variables under the control of an outside agent and are treated as variables of the circuit 
simulating the Turing machine M . ■ 

We now give two interpretations of the above simulation. The first establishes that the 
circuit complexity for a function provides a lower bound to the time required by a computation 
on a Turing machine. The second provides instances of problems that are P-complete and NP- 
complete. 



3.9.2 Computational Inequalities for Turing Machines 

When the simulation of Theorem 3.9.1 is specialized to a deterministic Turing machine M, a 
circuit is constructed that computes the function / computed by M in T steps with S bits of 
memory. It follows that Cci(f) and Dfi(f) cannot be larger than those given in this theorem, 
since this circuit also computes /. From this observation we have the following computational 
inequalities. 

THEOREM 3.9.2 The function f computed by an m-word, b-bit one-tape Turing machine in T 
steps can also be computed by a circuit whose size and depth satisfy the following bounds over any 
complete basis Q, where S = mb is the storage capacity used by this machine: 

Cn(f) = O(ST) 
D n (f) = 0(TlogS) 

Since S = 0(T) (at most T + 1 cells can be visited in T steps), we have the following 
corollary. It demonstrates that the time T to compute a function / with a Turing machine is 
at least the square root of its circuit size. As a consequence, circuit size complexity can be used 
to derive lower bounds on computation time on Turing machines. 
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COROLLARY 3.9.1 Let the function f be computed by an m-word, b-bit one-tape Turing machine 
in T steps, b fixed. Then, over any complete basis Q the following inequality must hold: 

Cn(f) = 0(T 2 ) 

There is no loss in assuming that a language L is a set of strings over a binary alpha- 
bet; that is, L C B* . As explained in Section 1.2.3, a language can be defined by a family 
{/i> fit f}> • ■ •} of characteristic (Boolean) functions, /„ : B" i— > B, where a string w of 
length n is in L if and only if f n (w) = 1. 

Theorem 3.9.2 not only establishes a clear connection between Turing time complexity 

? 
and circuit size complexity, but it also provides a potential means to resolve the question P = 

NP of whether P and NP are equal or not. Circuit complexity is currently believed to be the 

most promising tool to examine this question. (See Chapter 9.) 

3.9.3 Reductions from Turing to Circuit Computations 

As shown in Theorem 3.9.1, a circuit Cm,t can be constructed that simulates a time- and 
space-bounded computation by either a deterministic or a nondeterministic Turing machine 
M. If M is deterministic and accepts the binary input string w, then Cm,T has value 1 when 
supplied with the value of it). If M is nondeterministic and accepts the binary input string w, 
then for some values of the binary choice variables C, Cm.T ° n inputs w and c has value 1 . 

The language of strings describing circuits with fixed inputs whose value on these inputs 
is 1 is called CIRCUIT VALUE. When the circuits also have variable inputs whose values can 
be chosen so that the circuits have value 1 , the language of strings describing such circuits is 
called CIRCUIT SAT. (See Section 3.9.6.) The languages CIRCUIT VALUE and CIRCUIT SAT 
are examples of P-complete and NP-complete languages, respectively. 

The P-complete and NP-complete languages play an important role in complexity the- 
ory: they are prototypical hard languages. The P-complete languages can all be recognized in 
polynomial time on serial machines, but it is not known how to recognize them on parallel 
machines in time that is a polynomial in the logarithm of the length of strings (this is called 
poly-logarithmic time), which should be possible if they are parallelizable. The NP-complete 
languages can be recognized in exponential time on deterministic serial machines, but it is 
not known how to recognize them in polynomial time on such machines. Many important 
problems have been shown to be P-complete or NP-complete. 

Because so much effort has been expended without success in trying to show that the 
NP-complete (P-complete) languages can be solved serially (in parallel) in polynomial (poly- 
logarithmic) time, it is generally believed they cannot. Thus, showing that a problem is NP- 
complete (P-complete) is considered good evidence that a problem is hard to solve serially (in 
parallel) . 

To obtain such results, we exhibit a program that writes the description of the circuit Cm^t 
from a description of the TM M and the values written initially on its tape. The time and 
space needed by this program are used to classify languages and, in particular, to identify the 
P-complete and NP-complete languages. 

The simple program V shown schematically in Fig. 3.27 writes a description of the circuit 
Cm,t of Fig. 3.25, which is deterministic or nondeterministic depending on the nature of 
M. (Textual descriptions of circuits are given in Section 2.2. Also see Problem 3.8.) The 
first loop of this program reads the value of ith input letter Wi of the string w written on 
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for i := to n — 1 

READ_VALUE(Wj) 

WRITE_INPUT(i,Wi) 
for j := n to m — 1 

WRITE_INPUT(j,/?) 
for t := 1 to T 

WRITE_CONTROL_UNIT(i, C t ) 

WRITE_OR(£, m) 

for j := to m — 1 

WRITE_CELL_CIRCUIT(j, t) 

Figure 3.27 A program "P to write the description of a circuit Cm.t that simulates T steps of a 
nondeterministic Turing machine M and uses in memory words. It reads the n inputs supplied 
to M, after which it writes the input steps of a straight-line program that reads these n inputs as 
well as JTi — n blanks /3 into the first copy of a tape unit. It then writes the remaining steps of a 
straight-line program consisting of descriptions of the T copies of the control unit and the mT 
cell circuits simulating the T copies of the tape unit. 



the input tape of T, after which it writes a fragment of a straight-line program containing the 
value of Wi. The second loop sets the remaining initial values of cells to the blank symbol j3. 
The third outer loop writes a straight-line program for the control unit using the procedure 
WRITE_CONTROL_UNIT that has as arguments t, the index of the current time step, and c t , 
the tuple of Boolean choice input variables for the ith step. These choice variables are not used 
if M is deterministic. In addition, this loop uses the procedure WRITE_OR to write a straight- 
line program for the vector OR circuit that forms the contents V-t of the cell under the head 
after the £th step. Its inner loop uses the procedure WRITE.CELL.CIRCUIT with parameters j 
and t to write a straight-line program for the jth cell circuit in the ith tape. 

The program V given in Fig. 3.27 is economical in its use of space and time, as we show. 
Consider a language L in P; that is, for L there is a deterministic Turing machine Ml and a 
polynomial p(n) such that on an input string w of length n, Ml halts in T = p(n) steps. 
It accepts w if it is in L and rejects it otherwise. Since V uses space logarithmic in the values 
of 7i and T and T = p(n), T uses space logarithmic in n. (For example, if p(n) = n , 
log 2 p{n) = 61og 2 n = O(logn).) Such programs are called log-space programs. 

We show in Theorem 8.8.1 that the composition of two log-space programs is a log-space 
program, a non-obvious result. However, it is straightforward to show that the composition of 
two polynomial-time programs is a polynomial-time program. (See Problems 3.2 and 8.19.) 
Since V's inner and outer loops each execute a polynomial number of steps, it follows that V 
is a polynomial-time program. 

If M is nondeterministic, V continues to be a log-space, polynomial-time program. The 
only difference is that it writes a circuit description containing references to choice variables 
whose values are not specified in advance. We state these observations in the form of a theorem. 



THEOREM 3.9.3 Let L € P (L € NPJ. Then for each string w £ T* a deterministic (nondeter- 
ministic) circuit Cm,t can be constructed by a program in logarithmic space and polynomial time 
in n = \w\, the length ofw, such that the output of Cm,t> the value in the first tape cell, is (can 
be) assigned value 1 (for some values of the choice inputs) if w € L and if w ^ L. 
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The program of Fig. 3.27 provides a translation (or reduction ) from any language in NP 
(or P) to a language that we later show is a hardest language in NP (or P) . 

We now use Theorem 3.9.3 and the above facts to give a brief introduction to the P- 
complete and NP-complete languages, which are discussed in more detail in Chapter 8. 

3.9.4 Definitions of P-Complete and NP-Complete Languages 

In this section we identify languages that are hardest in the classes P and NP. A language Lq is 
hardest in one of these classes if a) Lq is itself in the class and b) for every language L in the 
class, a test for the membership of a string w in L can be constructed by translating w with an 
algorithm to a string v and testing for membership of v in Lq. If the class is P, the algorithm 
must use at most space logarithmic in the length of id, whereas in the case of NP, the algorithm 
must use time at most a polynomial in the length of w. Such a language L is said to be a 
complete language for this complexity class. We begin by defining the P-complete languages. 

DEFINITION 3.9. 1 A language L C B* is P-complete if it is in P and if for every language 
Lq C B* in P, there is a log-space deterministic program that translates each w G B* into a string 
w' G B* such that w G L if and only if w 1 G L. 

The NP-complete languages have a similar definition. However, instead of requiring that 
the translation be log-space, we ask only that it be polynomial-time. It is not known whether 
all polynomial-time computations can be done in logarithmic space. 

DEFINITION 3.9.2 A language L C B* is NP-complete if it is in NP and if for every language 
Lq C B* in NP, there is a polynomial-time deterministic program that translates each w G B* 
into a string w' G B* such that w G Lq if and only if w' G L. 

Space precludes our explaining the important role of the P-complete languages. We simply 
report that these languages are the hardest languages to parallelize and refer the reader to Sec- 
tions 8.9 and 8.14.2. However, we do explain the importance of the NP-complete languages. 

As the following theorem states, if an NP-complete language is in P; that is, if membership 
of a string in an NP-complete language can be determined in polynomial time, then the same 
can be done for every language in NP; that is, P and NP are the same class of languages. 
Since decades of research have failed to show that P = NP, a determination that a problem is 
NP-complete is a testimonial to but not a proof of its difficulty. 

THEOREM 3.9.4 If an NP-complete language is in P, then P = NP. 

Proof Let L be NP-complete and let Lq be an arbitrary language in NP. Because L is NP- 
complete, there is a polynomial-time program that translates an arbitrary string w into a 
string w' such that w' G L if and only if w G Lq. If L G P, then testing of membership 
of strings in Lq can be done in polynomial time in the length of the string. It follows that 
there exists a polynomial-time program to determine membership of a string in Lq. Thus, 
every language in NP is also in P. ■ 

3.9.5 Reductions to P-Complete Languages 

We now formally define CIRCUIT VALUE, our first P-complete language. 
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CIRCUIT VALUE 

Instance: A circuit description with fixed values for its input variables and a designated 

output gate. 

Answer: "Yes" if the output of the circuit has value 1 . 

THEOREM 3.9.5 The language CIRCUIT VALUE is V-complete. 

Proof To show that CIRCUIT VALUE is P-complete, we must show that it is in P and 
that every language in P can be translated to it by a log-space program. We have already 
shown the second half of the proof in Theorem 3.9.1. We need only show the first half, 
which follows from a simple analysis of the obvious program. Since a circuit is a graph of a 
straight-line program, each step depends on steps that precede it. (Such a program can be 
produced by a pre-order traversal of the circuit starting with its output vertex.) Now scan 
the straight-line program and evaluate and store in an array the value of each step. Successive 
steps access this array to find their arguments. Thus, one pass over the straight-line program 
suffices to evaluate it; the evaluating program runs in linear time in the length of the circuit 
description. Hence CIRCUIT VALUE is in P. ■ 

When we wish to show that a new language L\ is P-complete, we first show that it is in 
P. Then we show that every language L £ P can be translated to it in logarithmic space; that 
is, for each string w, there is an algorithm that uses temporary space 0(log \w\) (as does the 
program in Fig. 3.27) that translates w into a string v such that w is in L if and only if v is 
in L\. (This is called a log-space reduction. See Section 8.5 for a discussion of temporary 
space.) 

If we have already shown that a language Lq is P-complete, we ask whether we can save 
work by using this fact to show that another language, Li, in P is P-complete. This is pos- 
sible because the composition of two deterministic log-space algorithms is another log-space 
algorithm, as shown in Theorem 8.8.1. Thus, if we can translate Lq into L\ with a log-space 
algorithm, then every language in P can be translated into L\ by a log-space reduction. (This 
idea is suggested in Fig. 3.28.) Hence, the task of showing L\ to be P-complete is reduced 
to showing that L\ is in P and that Lq, which is P-complete, can be translated to L\ by a 
log-space algorithm. Many P-complete languages are exhibited in Section 8.9. 



log-space reduction by Def. 3.9.1 

log-space reduction 




by Def. 3.9.1 



Figure 3.28 A language Lo is shown P-complete by demonstrating that Lo is in P and that 
every language L in P can be translated to it in logarithmic space. A new language L\ is shown 
P-complete by showing that it is in P and that Lo can be translated to it in log-space. Since L can 
be Li, L\ can also be translated to Lo in log-space. 
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3.9.6 Reductions to NP-Complete Languages 

Our first NP-compIete language is CIRCUIT SAT, a language closely related to CIRCUIT 
VALUE. 

CIRCUIT SAT 

Instance: A circuit description with n input variables {x\, x%, . . . , x n } for some integer n 

and a designated output gate. 

Answer: "Yes" if there is an assignment of values to the variables such that the output of the 

circuit has value 1 . 

THEOREM 3.9.6 The language CIRCUIT SAT is NP -complete. 

Proof To show that CIRCUIT SAT is NP-complete, we must show that it is in NP and that 
every language in NP can be translated to it by a polynomial-time program. We have already 
shown the second half of the proof in Theorem 3.9.1. We need only show the first half. As 
discussed in the proof of Theorem 3.9.5, each circuit can be organized so that all steps on 
which a given step depends precede it. We assume that a string in CIRCUIT SAT meets 
this condition. Design an NTM which on such a string uses choice inputs to assign values 
to each of the variables in the string. Then invoke the program described in the proof of 
Theorem 3.9.5 to evaluate the circuit. For some assignment to the variables X\, X%, . . . , X n , 
this nondeterministic program can accept each string in CIRCUIT SAT but no string not in 
CIRCUIT SAT. It follows that CIRCUIT SAT is in NP. ■ 

The model used to show that a language is P-complete directly parallels the model used to 
show that a language L\ is NP-complete. We first show that L\ is in NP and then show that 
every language L G NP can be translated to it in polynomial time. That is, we show that there 
is a polynomial p and algorithm that on inputs of length n runs in time p(n), and that for 
each string w the algorithm translates w into a string v such that w is in L if and only if v is 
in L\. (This is called a polynomial-time reduction.) Since any algorithm that uses log-space 
(as does the program in Fig. 3.27) runs in polynomial time (see Theorem 8.5.8), a log-space 
reduction can be used in lieu of a polynomial-time reduction. 

If we have already shown that a language L is NP-complete, we can show that another 
language, L\, in NP is NP-complete by translating L into L\ with a polynomial-time algo- 
rithm. Since the composition of two polynomial-time algorithms is another polynomial-time 
algorithm (see Problem 3.2), every language in NP can be translated in polynomial time into 
L\ and L\ is NP-complete. The diagram shown in Fig. 3.28 applies when the reductions 
are polynomial-time and the languages are members of NP instead of P. Many NP-complete 
languages are exhibited in Section 8.10. 

We apply this idea to show that SATISFIABILITY is NP-complete. Strings in this language 
consist of strings representing the POSE (product-of-sums expansion) of a Boolean function. 
Thus, they consist of clauses containing literals (a variable or its negation) with the property 
that for some value of the variables at least one literal in each clause is satisfied. 

SATISFIABILITY 

Instance: A set of literals X = {x\ , X\ , x-i, Xz, ■ ■ ■ , x n , x n } and a sequence of clauses 

C = (ci, C2, . . • , Cm) where each clause Cj is a subset of X. 

Answer: "Yes" if there is a (satisfying) assignment of values for the variables {xi,a;2, . . . , 

x n } over the set B such that each clause has at least one literal whose value is 1. 
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THEOREM 3.9.7 SATISFIABILITY is NJ?-complete. 

Proof SATISFIABILITY is in NP because for each string w in this language there is a sat- 
isfying assignment for its variables that can be verified by a polynomial-time program. We 
sketch a deterministic RAM program for this purpose. This program reads as many choice 
variables as there are variables in w and stores them in memory locations. It then evalu- 
ates each literal in each clause in w and declares this string satisfied if all clauses evaluate 
to 1. This program, which runs in time linear in the length of w, can be converted to 
a Turing-machine program using the construction of Theorem 3.8.1. This program ex- 
ecutes in a time cubic in the time of the original program on the RAM. We now show 
that every language in NP can be reduced to SATISFIABILITY via a polynomial-time pro- 
gram. 

Given an instance of CIRCUIT SAT, as we now show, we can convert the circuit descrip- 
tion, a straight-line program (see Section 2.2), into an instance of SATISFIABILITY such that 
the former is a "yes" instance of CIRCUIT SAT if and only if the latter is a "yes" instance 
of SATISFIABILITY. Shown below are the different steps of a straight-line program and the 
clauses used to replace them in constructing an instance of SATISFIABILITY. A determinis- 
tic TM can be designed to make these translations in time proportional to the length of the 
circuit description. Clearly the instance of SATISFIABILITY that it produces is a satisfiable 
instance if and only if the instance of CIRCUIT SAT is satisfiable. 
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For each gate type it is easy to see that each of the corresponding clauses is satisfiable 
only for those gate and argument values that are consistent with the type of gate. For ex- 
ample, a NOT gate with input gj has value gi = 1 when gj has value and gi = when 
gj has value 1. In both cases, both of the clauses (g~l V gj) and (gi V gj) are satisfied. 
However, if gi is equal to gj , at least one of the clauses is not satisfied. Similarly, if gi 
is the AND of gj and g k , then examining all eight values for the triple (gi,gj,gk) shows 
that only when gi is the AND of gj and gk are all three clauses satisfied. The verification 
of the above statements is left as a problem for the reader. (See Problem 3.36.) Since the 
output clause (gj) is true if and only if the circuit output has value 1, it follows that the 
set of clauses are all satisfiable if and only if the circuit in question has value 1; that is, it is 
satisfiable. 

Given an instance of CIRCUIT SAT, clearly a deterministic TM can produce the clauses 
corresponding to each gate using a temporary storage space that is logarithmic in the length 
of the circuit description because it need deal only with integers that are linear in the length 
of the input. Thus, each instance of CIRCUIT SAT can be translated into an instance of 
SATISFIABILITY in a number of steps polynomial in the length of the instance of CIRCUIT 
SAT. Since it is also in NP, it is NP-complete. ■ 
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3.9.7 An Efficient Circuit Simulation of TM Computations* 

In this section we construct a much more efficient circuit of size 0(Tblogm) that simulates 
a computation done in T steps by an m-word, 6-bit one-tape TM. A similar result on circuit 
depth is shown. 



THEOREM 3.9.8 Let an m-word, b-bit Turing machine compute in T steps the function f, a 

projection off^M ' > the function computed by the TM in T ste 
the size and depth of f over the complete basis fl must be satisfied 



projection 0/"/tm ' , the function computed by the TM in T steps. Then the following bounds on 



C n (f) = 0(T(log[mm(bT,S)}) 
£>„(/) = 0(T) 

Proof The circuit Cm,t described in Theorem 3.9.1 has size proportional to O(ST), where 
S = mb. We now show that a circuit computing the same function, N(l, T, m), can be 
constructed whose size is O (T(log[min(6T, S)\). This new circuit is obtained by more 
efficiently simulating the tape unit portion of a Turing machine. We observe that if the head 
never reaches a cell, the cell circuit of Fig. 3.26 can be replaced by wires that pass its inputs 
to its output. It follows that the number of gates can be reduced if we keep the head near 
the center of a simulated tape by "centering" it periodically. This is the basis for the circuit 
constructed here. 

It simplifies the design of N(\,T, m) to assume that the tape unit has cells indexed 
from —m to m. Since the head is initially placed over the cell indexed with 0, it is over 
the middle cell of the tape unit. (The control unit is designed so that the head never enters 
cells whose index is negative.) We construct N(l, T, m) from a subcircuit N(c, s, n) that 
simulates s steps of a tape unit containing n 6-bit cells under the assumption that the tape 
head is initially over one of the middle c cells where c and n are odd. Here n > c + 2s, so 
that in s steps the head cannot move from one of the middle c cells to positions that are not 
simulated by this circuit. Let C(c, s, n) and D(c, s, n) be the size and depth of N(c, s, n). 

As base cases for our recursive construction of N(c, s, n), consider the circuits N(l, 1, 3) 
and N(3, 1, 5). They can be constructed from copies of the tape circuit Ct(3) and Ct(5) 
since they simulate one step of tape units containing three and five cells, respectively. In fact, 
these circuits can be simplified by removing unused gates. Without simplification Ct(n) 
contains 5(6 + 1) gates in each of the n cell circuits (see Fig. 3.26) as well as (n — 1)6 gates 
in the vector OR circuit, for a total of at most 6n(b + 1) gates. It has depth 4 + [log 2 n~\ . 
Thus, N(l, 1, 3) and N(3, 1, 5) each can be realized with 0(6) gates and depth 0(1). 

We now give a recursive construction of a circuit that simulates a tape unit. The 
./V(l, 2q, 4q + 1) circuit simulates 2q steps of the tape unit when the head is over the middle 
cell. It can be decomposed into an N(l, q, 2q + 1) circuit simulating the first q steps and 
an N(2q + l,q,4q + 1) circuit simulating the second q steps, as shown in Fig. 3.29. In 
the N(\,q,2q +1) circuit, the head may move from the middle position to any one of 
2q + 1 positions in q steps, which requires that 2q + 1 of the inputs be supplied to it. In the 
N(2q + 1, q, 4q + 1) circuit, the head starts in the middle 2q + 1 positions and may move 
to any one of 4q + 1 middle positions in the next q steps, which requires that 4q + 1 inputs 
be supplied to it. The size and depth of our N(l, 2q, 4q + 1) circuit satisfy the following 
recurrences: 
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Figure 3.29 A decomposition of an N(l,2q,4q + 1) circuit. 
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(3.4) 



When the number of tape cells is bounded, the above construction and recurrences 
can be modified. Let m = 2 P be the maximum number of cells used during a T-step 
computation by the TM. We simulate this computation by placing the head over the middle 
of a tape with 2m + 1 cells. It follows that at least m steps are needed to reach each of the 
reachable cells. Thus, if T < m, we can simulate the computation with an N(l, T, 2T + 1) 
circuit. If T > m, we can simulate the first m steps with an iV(l, m, 2m +1) circuit and 
the remaining T — m steps with \(T — m)/m~\ copies of an N(2m+ 1, m,4m+ 1) circuit. 
This follows because at the end of the first m steps the head is over the middle 2m + 1 of 
4m + 1 cells (of which only 2m + 1 are used) and remains in this region after m steps due 
to the limitation on the number of cells used by the TM. 

From the above discussion we have the following bounds on the size C(T, m) and depth 
D(T, m) of a simulating circuit for a T-step, m-word TM computation: 



f C(\,T,2T+ 1) T <m 

C(T,m)<< , rrrn 

[ C(l,m,2m+l) + ([— 1 - l)C(2m+l, m, Am + 1) T>m 



(3.5) 



D(T,m) < 



D{l,T,2T+ 1) 
D(l,m,2m+l) + (\£] 



T < m 
l) D(2m+ l,m,4m+ 1) T>m 



We complete the proof of Theorem 3.9.8 by bounding C{\,2q,4q + 1), C(2q + 
l,q,4q + 1), D(l,2q,4q + 1), and D(2q + l,q,4q + 1) appearing in (3.4) and com- 
bining them with the bounds of (3.5). 

We now give a recursive construction of an N(2q + 1, q, 4q + 1) circuit from which 
these bounds are derived. Shown in Fig. 3.30 is the recursive decomposition of an N(4t + 
1 , 2t, 8t+ 1 ) circuit in terms of two copies of N(2t+ 1 , t, 4t+ 1 ) circuits. The ^-centering cir- 
cuits detect whether the head is in positions 2i, 2t— 1, . . ., 1, or in positions — 1, . . . , — 2t. 
In the former case, this circuit cyclically shifts the 8t + 1 inputs inputs down by t positions; 
in the latter, it cyclically shifts them up by t positions. The result is that the head is centered 
in the middle 2£+ 1 positions. The OR of s_i, . . . , s_2t can be used as a signal to determine 
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Figure 3.30 A recursive decomposition of N(4t + l,2t,8t+ 1). 



which shift to take. After centering, t steps are simulated, the head is centered again, and 
another t steps are again simulated. Two i-correction circuits cyclically shift the results in 
directions that are the reverse of the first two shifts. This circuit correctly simulates the tape 
computation over 2t steps and produces an N(At + 1, 2t, 8t + 1) circuit. 

A i-centering circuit can be realized as a single stage of the cyclic shift circuit described 
in Section 2.5.2 and shown in Fig. 2.8. A i-correction circuit is just a i-centering circuit 
in which the shift is in the reverse direction. The four shifting circuits can be realized with 
0(tb) gates and constant depth. The two OR trees to determine the direction of the shift can 
be realized with 0(t) gates and depth O(logi). From this discussion we have the following 
bounds on the size and depth of N(4t + 1, 2i, 8t + 1): 

C(4t + 1, It, U + 1) < 2C{2t + 1, t, 4t + 1) + 0(bt) 

C(3, 1, 5) < 0(b) 
D(4t + 1, 2t, 8i + 1) < 2D(2t + 1, t, 4t + I) + 2|~log 2 1] 

D(3,1,5)<0(1) 

We now solve this set of recurrences. Let C(k) = C(2t + \,t,4t + 1) and V(k) = 
D(2t+l, t, 4t-\- 1) when t = 2 . The above bounds translate into the following recurrences: 

C(k + 1) < 2C{k) + K{2 k + K 2 
C(0) < K 3 

V(k + 1) < 2V(k) + 2k + K 4 
V(0) < K 5 

for constants K\, Ki, K$, K4, and K5. It is straightforward to show that C(k + 1) and 
T>(k + 1) satisfy the following inequalities: 

C(k) < 2 k (K 1 k/2 + K 2 + K 3 ) - K 2 
V{k) < 2 k {K 5 + K A + 2) -2k- {Ka + 2) 
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We now derive explicit upper bounds to (3.4). Let A(fc) = C(l, q, 2q + 1) and A(fc) = 
D{\, q, 1q + 1) when q = 2 . Then, the inequalities of (3.4) become the following: 

A(fc+ 1) < A(jfc)+C(jk) 

A(0) < K 6 
A(fc+ 1) < A{k) + V(k) 

A(0) < K 7 

where K 6 = C(l, 1,3) = 76+3 and K 7 = D{\, 1,3) = 4. The solutions to these 
recurrences are given below. 

fc-i 
A(fc) < £c(j) 

3=0 

= 2 k {K 1 k/2 + K 2 + K } -K x )- kK 2 + {K 6 - {K 2 + K 3 - K x )) 
= 0{k2 k ) 
fc-i 

A(fc)< y, v (j) 

3=0 

= 2 k {K 5 + K 4 + 2)-k 2 + (\- (K 4 + 2))k + (K 7 - (K 5 + K 4 + 2)) 
= 0(2 k ) 

Here we have made use of the identity in Problem 3.1. From (3.5) and (3.6) we establish 
the result of Theorem 3.9.8. ■ 



3.10 Design of a Simple CPU 



In this section we design an eleven-instruction CPU for a general-purpose computer that has a 
random-access memory with 2 16-bit memory words. We use this design to illustrate how a 
general-purpose computer can be assembled from gates and binary storage devices (flip-flops). 
The design is purposely kept simple so that basic concepts are made explicit. In practice, 
however, CPU design can be very complex. Since the CPU is the heart of every computer, a 
high premium is attached to making them fast. Many clever ideas have been developed for this 
purpose, almost all of which we must for simplicity ignore here. 

Before beginning, we note that a typical complex instruction set (CISC) CPU, one with 
a rich set of instructions, contains several tens of thousands of gates, while as shown in the 
previous section, a random-access memory unit has a number of equivalent gates proportional 
to its memory capacity in bits. (CPUs are often sold with caches, small random-access memory 
units that add materially to the number of equivalent gates.) The CPUs of reduced instruction 
set (RISC) computers have many fewer gates. By contrast, a four-megabyte memory has the 
equivalent of several tens of millions of gates. As a consequence, the size and depth of the 
next-state and output functions of the random-access memory, <5rmem and Armem> typically 
dominate the size and depth of the next-state and output functions, <5cpu and Acptj, of the 
CPU, as shown in Theorem 3.6.1. 
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3.10.1 The Register Set 



A CPU is a sequential circuit that repeatedly reads and executes an instruction from its memory 
in what is known as the fetch-and-execute cycle. (See Sections 3.4 and 3.10.2.) A machine- 
language program is a set of instructions drawn from the instruction set of the CPU. In our 
simple CPU each instruction consists of two parts, an opcode and an address, as shown 
schematically below. 



1 



4 5 



16 



Opcode Address 



Since our computer has eleven instructions, we use a 4-bit opcode, a length sufficient to 
represent all of them. Twelve bits remain in the 16-bit word, providing addresses for 4,096 
16-bit words in a random-access memory. 

We let our CPU have eight special registers: the 16-bit accumulator (AC), the 12-bit 
program counter (PC), the 4-bit opcode register (OPC), the 12-bit memory address register 
(MAR), the 16-bit memory data register (MDR), the 16-bit input register (INR), the 16- 
bit output register (denoted OUTR), and the halt register (HLT). These registers are shown 
schematically together with the random-access memory in Fig. 3.31. 

The program counter PC contains the address from which the next instruction will be 
fetched. Normally this is the address following the address of the current instruction. However, 
if some condition is true, such as that the contents of the accumulator AC are zero, the program 
might place a new address in the PC and jump to this new address. The memory address 
register MAR contains the address used by the random-access memory to fetch a word. The 
memory data register MDR contains the word fetched from the memory. The halt register 
HLT contains the value if the CPU is halted and otherwise contains 1 . 
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Figure 3.31 Basic registers of the simple CPU and the paths connecting them. Also shown 
is the arithmetic logic unit (ALU) containing circuits for AND, addition, shifting, and Boolean 
complement. 
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3.10.2 The Fetch-and-Execute Cycle 

The fetch-and-execute cycle has a fetch portion and an execution portion. The fetch portion 
is always the same: the instruction whose address is in the PC is fetched into the MDR and 
the opcode portion of this register is copied into the OPC. At this point the action of the CPU 
diverges, based on the instruction denoted by the value of the OPC. Suppose, for example, 
that the OPC denotes a load accumulator instruction. The action required is to copy the word 
specified by the address part of the instruction into the accumulator. Fig. 3.32 contains a de- 
composition of the load accumulator instruction into eight microinstructions executed in six 
microcycles. During each microcycle several microinstructions can be executed concurrently, 
as shown in the table for the second and fourth microcycles. In Section 3.10.5 we describe 
implementations of the fetch-and-execute cycle for each of the instructions of our computer. 

It is important to note that a realistic CPU must do more than fetch and execute instruc- 
tions: it must be interruptable by a user or an external device that demands its attention. After 
fetching and executing an instruction, a CPU typically examines a small set of flip-flops to see 
if it must break away from the program it is currently executing to handle an interrupt, an 
action equivalent to fetching an instruction associated with the interrupt. This action causes 
an interrupt routine to be run that responds to the problem associated with the interrupt, after 
which the CPU returns to the program it was executing when it was interrupted. It can do 
this by saving the address of the next instruction of this program (the value of the PC) at a 
special location in memory (such as address 0). After handling the interrupt, it branches to 
this address by reloading PC with the old value. 

3.10.3 The Instruction Set 

Figure 3.33 lists the eleven instructions of our simple CPU. The first group consists of arith- 
metic (see Section 2.7), logic, and shift instructions (see Section 2.5.1). The circulate in- 
struction executes a cyclic shift of the accumulator by one place. The second group consists 
of instructions to move data between the accumulator and memory. The third set contains 
a conditional jump instruction: when the accumulator is zero, it causes the CPU to resume 
fetching instructions at a new address, the address in the memory data register. This address 
is moved to the program counter before fetching the next instruction. The fourth set contains 
input/output instructions. The fifth set contains the halt instruction. Many more instruc- 



Cycle Microinstruction Microinstruction 

1 Copy contents of PC to MAR. 

2 Fetch word at address MAR into MDR. Increment PC. 

3 Copy opcode part of MDR to OPC. 

4 Interpret OPC Copy address part of MDR 



to MAR. 



5 Fetch word at address MAR into MDR. 

6 Copy MDR into AC. 



Figure 3.32 Decomposition of the load accumulator instruction into eight microinstructions 
in six microcycles. 
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Opcode 


Binary 


Description 










Arithmetic 
Logic 


ADD 
AND 

CLA 
CMA 

CIL 


0000 
0001 
0010 
0011 
0100 


Add memory word to AC 
AND memory word to AC 
Clear (set to zero) the accumulator 
Complement AC 
Circulate AC left 


Memory 


LDA 
STA 


0101 
0110 


Load memory word into AC 
Store AC into memory word 


Jump 


JZ 


0111 


Jump to address if AC zero 


I/O 


IN 
OUT 


1000 
1001 


Load INR into AC 
Store AC into OUTR 


Halt 


HLT 


1010 


Halt computer 



Figure 3.33 Instructions of the simple CPU. 



tions could be added, including ones to simplify the execution of subroutines, handle loops, 
and process interrupts. Each instruction has a mnemonic opcode, such as CLA, and a binary 
opcode, such as 0010. 

Many other operations can be performed using this set, including subtraction, which 
can be realized through the use of ADD, CMA, and twos-complement arithmetic (see Prob- 
lem 3.18). Multiplication is also possible through the use of CIL and ADD (see Problem 3.38). 
Since multiple CILs can be used to rotate right one place, division is also possible. Finally, as 
observed in Problem 3.39, every two-input Boolean function can be realized through the use 
of AND and CMA. This implies that every Boolean function can be realized by this machine 
if it is designed to address enough memory locations. 

Each of these instructions is a direct memory instruction, by which we mean that all 
addresses refer directly to memory locations containing the operands (data) on which the pro- 
gram operates. Most CPUs also have indirect memory instructions (and are said to support 
indirection). These are instructions in which an address is interpreted as the address at which 
to find the address containing the needed operand. To find such an indirect operand, the CPU 
does two memory fetches, the first to find the address of the operand and the second to find 
the operand itself. Often a single bit is added to an opcode to denote that an instruction is an 
indirect memory instruction. 

An instruction stored in the memory of our computer consists of sixteen binary digits, the 
first four denoting the opcode and the last twelve denoting an address. Because it is hard for 
humans to interpret such machine-language statements, mnemonic opcodes and assembly 
languages have been devised. 

3.10.4 Assembly- Language Programming 

An assembly-language program consists of a number of lines each containing either a real or 
pseudo-instruction. Real instructions correspond exactly to machine-language instructions ex- 
cept that they contain mnemonics and symbolic addresses instead of binary sequences. Pseudo- 
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instructions are directions to the assembler, the program that translates an assembly-language 
program into machine language. A typical pseudo-instruction is ORG 100, which instructs 
the assembler to place the following lines at locations beginning with location 100. Another 
example is the DAT pseudo-instruction that identifies a word containing only data. The END 
pseudo-instruction identifies the end of the assembly-language program. 

Each assembly-language instruction fits on one line. A typical instruction has the following 
fields, some or all of which may be used. 



Symbolic_Address 


Mnemonic 


Address 


Indirect Bit 


Comment 



If an instruction has a Symbolic_Address (a string of symbols), the address is converted 
to the physical address of the instruction by the assembler and substituted for all uses of the 
symbolic address. The Address field can contain one or more symbolic or real addresses, al- 
though the assembly language used here allows only one address. The Indirect Bit specifies 
whether or not indirection is to be used on the address in question. In our CPU we do not 
allow indirection, although we do allow it in our assembly language because it simplifies our 
sample program. 

Let's now construct an assembly-language program whose purpose is to boot up a computer 
that has been reset. The boot program reads another program provided through its input port 
and stores this new program (a sequence of 16-bit words) in the memory locations just above 
itself. When it has finished reading this new program (determined by reading a zero word), 
it transfers control to the new program by jumping to the first location above itself. When 
computers are turned off at night they need to be rebooted, typically by executing a program 
of this kind. 

Figure 3.34 shows a program to boot up our computer. It uses three symbolic addresses, 
ADDR_1, ADDR_2, ADDR_3, and one real address, 10. We assume this program resides 





ORG 







Program is stored at location 0. 


ADDR_1 


IN 






Start of program. 




JZ 


10 




Transfer control if AC zero. 




STA 


ADDR_2 


I 


Indirect store of input. 




LDA 


ADDR_2 




Start incrementing ADDR_2. 




ADD 


ADDR_3 




Finish incrementing of ADDR_2. 




STA 


ADDR_2 




Store new value of ADDR_2. 




CLA 






Clear AC. 




JZ 


ADDR.l 




Jump to start of program. 


ADDR_2 


DAT 


10 




Address for indirection. 


ADDR_3 


DAT 
END 


1 




Value for incrementing. 



Figure 3.34 A program to reboot a computer. 
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permanently in locations through 9 of the memory. After being reset, the CPU reads and 
executes the instruction at location of its memory. 

The first instruction of this program after the ORG statement reads the value in the input 
register into the accumulator. The second instruction jumps to location 10 if the accumulator 
is zero, indicating that the last word of the second program has been written into the memory. 
If this happens, the next instruction executed by the CPU is at location 10; that is, control is 
transferred to the second program. If the accumulator is not zero, its value is stored indirectly at 
location ADDR_2. (We explain the indirect STA in the next paragraph.) On the first execution 
of this command, the value of ADDR_2 is 10, so that the contents of the accumulator are 
stored at location 10. The next three steps increment the value of ADDR_2 by placing its 
contents in the accumulator, adding the value in location ADDR_3 to it, namely 1, and storing 
the new value into location ADDR_2. Finally, the accumulator is zeroed and a JZ instruction 
used to return to location ADDR_1 , the first address of the boot program. 

The indirect STA instruction in this program is not available in our computer. However, 
as shown in Problem 3.42, this instruction can be simulated by a self-modifying subprogram. 
While it is considered bad programming practice to write self-modifying programs, this exer- 
cise illustrates the power of self-modification as well as the advantage of having indirection in 
the instruction set of a computer. 

3.10.5 Timing and Control 

Now that the principles of a CPU have been described and a programming example given, we 
complete the description of a sequential circuit realizing the CPU. To do this we need to de- 
scribe circuits controlling the combining and movement of data. To this end we introduce the 
assignment notation in Fig. 3.35. Here the expression AC < — MDR means that the contents 
of MDR are copied into AC, whereas AC < — AC + MDR means that the contents of AC and 
MDR are added and the result assigned to AC. In all cases the left arrow, <—, signifies that 
the result or contents on the right are assigned to the register on the left. However, when the 
register on the left contains information of a particular type, such as an address in the case of 
PC or an opcode in the case of OPC, and the register on the right contains more information, 
the assignment notation means that the relevant bits of the register on the right are loaded 
into the register on the left. For example, the assignment PC « — MDR means that the address 
portion of MDR is copied to PC. 

Register transfer notation uses these assignment operations as well as timing information 
to break down a machine-level instruction into microinstructions that are executed in succes- 



Notation 



AC <- MDR 

AC <- AC + MDR 

MDR^M 

M^MDR 

PC <- MDR 

MAR <- PC 



Explanation 



Contents of MDR loaded into AC. 

Contents of MDR added to AC. 

Contents of memory location MAR loaded into MDR. 

Contents of MDR stored at memory location MAR. 

Address portion of MDR loaded into PC. 

Contents of PC loaded into MAR. 



Figure 3.35 Microinstructions illustrating assignment notation. 
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MAR <- PC 

MDR^M, PC^PC+1 
OPC <- MDR 
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Figure 3.36 The microcode for the fetch portion of each instruction. 



sive microcycles. The jth microcycle is specified by the timing variable tj, 1 < j < k. That 
is, tj is 1 during the jth microcycle and is zero otherwise. It is straightforward to show that 
these timing variables can be realized by connecting a decoder to the outputs of a counting 
circuit, a circuit containing the binary representation of an integer that increments the integer 
modulo some other integer on each clock cycle. (See Problem 3.40.) 

Since the fetch portion of each instruction is the same, we write a few lines of register 
transfer notation for it, as shown in Fig. 3.36. On the left-hand side of each line is timing 
variable indicating the cycle during which the microinstruction is executed. 

The microinstructions for the execute portion of each instruction of our computer are 
shown in Fig. 3.37. On the left-hand side of each line is a timing variable that must be ANDed 
with the indicated instruction variable, such as cadd> which is 1 if that instruction is in 
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CADD ti 
CADD *5 
C ADD *6 



MAR <- MDR 

MDR^M 

AC <- AC + MDR 



STA 



CSTA ti 
CSTA ^4 
CSTA £5 



MAR «- MDR 
MDR <- AC 
M^MDR 



AND 



CMA 



CAND ti 

CAND *5 

CAND t 6 

CLA 



MAR <- MDR 

MDR^M 

AC <- AC AND MDR 



ccla U I AC 
CIL 



ccil U I AC 
LDA 



Shift(AC) 



clda h 

CLDA *5 

clda U 



MAR <- MDR 
MDR^M 
AC <- MDR 



CCMA 
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AC <- -. AC 


JZ 






cjz 


U 


if ( AC = 0) PC <- MDR 


IN 






cm 


U 


AC <- INR 


OUT 






COUT 


u 


OUTR <- AC 


HLT 






CHLT 


u 


tj <- for 1 < j < k 



FigUfe 3.37 The execute portions of the microcode of instructions. 



144 Chapter 3 Machines with Memory Models of Computation 

the opcode register OPC and otherwise. These instruction variables can be generated by a 
decoder attached to the output of OPC. Here -A denotes the complement of the accumulator. 

Now that we understand how to combine microinstructions in microcycles to produce 
macroinstructions, we use this information to define control variables that control the move- 
ment of data between registers or combine the contents of two registers and assign the result 
to another register. This information will be used to complete the design of the CPU. 

We now introduce notation for control variables. If a microinstruction results in the 
movement of data from register B to register A, denoted A < — B in our assignment nota- 
tion, we associate the control variable L(A, B) with it. If a microinstruction results in the 
combination of the contents of registers B and C with the operation and the assignment 
of the result to register A, denoted A <— B C in our assignment notation, we associate 
the control variable L(A,B C) with it. For example, inspection of Figs. 3.36 and 3.37 
shows that we can write the following expressions for the control variables L(OPC, MDR) 
andL(AC,AC+MDR): 

L(OPC,MDR) =i 3 
L(AC, AC+MDR) = c A dd A t 6 

Thus, OPC is loaded with the contents of MDR when £3 = 1, and the contents of AC are 
added to those of MDR and copied into AC when cadd A £g = 1. 

The complete set of control variables can be obtained by first grouping together all the mi- 
croinstructions that affect a given register, as shown in Fig. 3.38, and then writing expressions 
for the control variables. Here M denotes the memory unit and HLT is a special register that 
must be set to 1 for the CPU to run. Inspection of Fig. 3.38 leads to the following expressions 
for control variables: 



L(AC, AC + MDR) = cadd A t 6 
L(AC, AC AND MDR) = c A nd A t 6 
L(AC, 0) = ccla A £4 
L{AC, Shift(AQ) = ccil A £4 
L (AC, MDR) = c LD a A£ 6 
L(AC, INR) = cm A £ 4 
L(AC, - AC) = ccma A U 
L(MAR,PC) =£1 
L(MAR, MDR) = (cadd V c AND V c LD a V c STA ) A £4 
L(MDR, M) = £ 2 V (cadd V cand V c L da) A £5 
L (MDR, AC) = csta A £4 
L(M,MDR) = c S TA A £ 5 
L(PC,PC+1) =£2 
L(PC, MDR) = {AC = 0) A cj Z A £4 
L(OPC, MDR) = £ 3 
L(OUTR, AC) = cout A £4 

L(tj) = chlt A £4 for 1 < j < 6 
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Figure 3.38 The microinstructions affecting each register. 
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OPC <- MDR 


OUTR 






COUT 
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OUTR <- AC 


HLT 







tj <- for 1 < j < k 



The expression (AC = 0) denotes a Boolean variable whose value is 1 if all bits in the AC 
are zero and otherwise. This variable is the AND of the complement of each component of 
register AC. 

To illustrate the remaining steps in the design of the CPU, we show in Fig. 3.39 the 
circuits used to provide input to the accumulator AC. Shown are registers AC, MDR, and 
INR as well as circuits for the functions / ac id (see Section 2.7) and / an d that add two bi- 
nary numbers and take their AND, respectively. Also shown are multiplexer circuits / mU x (see 
Section 2.5-5). They have three control inputs, L , L\, and L2, and can select one of eight 
inputs to place on their output lines. However, only seven inputs are needed: the result of 
adding AC and MDR, the result of ANDing AC and MDR, the zero vector, the result of shift- 
ing AC, the contents of MDR or INR, and the complement of AC. The three control inputs 
encode the seven control variables, L(AC, AC + MDR), L(AC, AC AND MDR), L(AC, 0), 
L(AC, Shift(AQ), L(AC, MDR), L(AC, INR), and L(AC, -AC). Since at most one of these 
control variables has value 1 at any one time, the encoder circuit of Section 2.5.3 can be used 
to encode these seven control variables into the three bits Lq, L\, and L2 shown in Fig. 3.39. 

The logic circuit to supply inputs to AC has size proportional to the number of bits in each 
register. Thus, if the word size of the CPU were scaled up, the size of this circuit would scale 
linearly with the word size. 
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Figure 3.39 Circuits providing input to the accumulator AC. 



The circuit for the program counter PC can be designed from an adder, a multiplexer, and 
a few additional gates. Its size is proportional to [log 2 m~\ . The circuits to supply inputs to 
the remaining registers, namely MAR, MDR, OPC, INR, and OUTR, are less complex to 
design than those for the accumulator. The same observations apply to the control variable to 
write the contents of the memory. The complete design of the CPU is given as an exercise (see 
Problem 3.41). 



3.10.6 CPU Circuit Size and Depth 

Using the design given above for a simple CPU as a basis, we derive upper bounds on the size 
and depth of the next-state and output functions of the RAM CPU defined in Section 3.4. 

All words on which the CPU operates contain b bits except for addresses, which contain 
[log m] bits where m is the number of words in the random-access memory. We assume that 
the CPU not only has an [log m~\ -bit program counter but can send the contents of the PC 
to the MAR of the random-access memory in one unit of time. When the CPU fetches an 
instruction that refers to an address, it may have to retrieve multiple 6-bit words to create an 
[log m] -bit address. We assume the time for such operations is counted in the number T of 
steps that the RAM takes for the computation. 

The arithmetic operations supported by the RAM CPU include addition and subtraction, 
operations realized by circuits with size and depth linear and logarithmic respectively in b, the 
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length of the accumulator. (See Section 2.7.) The same is true for the logical vector and the 
shift operations. (See Section 2.5.1.) Thus, circuits affecting the accumulator (see Fig. 3.39) 
have size 0(b) and depth 0(logb). Circuits affecting the opcode and output registers and 
the memory address and data registers are simple and have size 0(b) and depth 0(logb). 
The circuits affecting the program counter not only support transfer of data from the accu- 
mulator to the program counter but also allow the program counter to be incremented. The 
latter function can be performed by an adder circuit whose size is 0(\logm~\) and depth is 
O (log [log rn\). It follows that 

Cn(S C Pv) = 0(b+\\ogm]) 
Dn{Scpu) = O (log 6 + log [log ml) 

3.10.7 Emulation 

In Section 3.4 we demonstrated that whatever computation can be done by a finite-state ma- 
chine can be done by a RAM when the latter has sufficient memory. This universal nature of 
the RAM, which is a model for the CPU we have just designed, is emphasized by the problem 
of emulation, the simulation of one general-purpose computer by another. 

Emulation of a target CPU by a host CPU means reading the instructions in a program 
for the target CPU and executing host instructions that have the same effect as the target 
instructions. In Problem 3.44 we ask the reader to sketch a program to emulate one CPU 
by another. This is another manifestation of universality, this time for unbounded-memory 
RAMs. 



Problems 

MATHEMATICAL PRELIMINARIES 

3.1 Establish the following identity: 

fc 

]T^ = 2((fc-l)2 fc + l) 

3=0 

3.2 Let p : N h» N and q : IN t— > IN be polynomial functions on the set IN of non- 
negative integers. Show that p(q(n)) is also a polynomial in n. 

FINITE-STATE MACHINES 

3.3 Describe an FSM that compares two binary numbers supplied as concurrent streams of 
bits in descending order of importance and enters a rejecting state if the first string is 
smaller than the second and an accepting state otherwise. 

3.4 Describe an FSM that computes the threshold-two function on n Boolean inputs that 
are supplied sequentially to the machine. 

3.5 Consider the full-adder function fFA(xi> Vi> Ci) = (cj+i> s i) defined below where + 
denotes integer addition: 



2 c 



■i+l 
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Show that the subfunction of fpA obtained by fixing c,i = and deleting c i+1 is the 
EXCLUSIVE OR of the variables Xi and J/j. 

3.6 It is straightforward to show that every Moore FSM is a Mealy FSM. Given a Mealy 
FSM, show how to construct a Moore FSM whose outputs for every input sequence are 
identical to those of the Mealy FSM. 

3.7 Find a deterministic FSM that recognizes the same language as that recognized by the 
nondeterministic FSM of Fig. 3.8. 

3.8 Write a program in a language of your choice that writes the straight-line program 
described in Fig. 3.3 for the FSM of Fig. 3.2 realizing the EXCLUSIVE OR function. 

SHALLOW FSM CIRCUITS 

3.9 Develop a representation for states in the m-word, 6-bit random-access memory so that 
its next-state mappings form a semigroup. 

Hint: Show that the information necessary to update the current state can be succinctly 
described. 

3.10 Show that matrix multiplication is associative. 

SEQUENTIAL CIRCUITS 

3.11 Show that the circuit of Fig. 3.15 computes the functions defined in the tables of 
Fig. 3.14. 

Hint: Section 2.2 provides a method to produce a circuit from a tabular description of 
a binary function. 

3.12 Design a sequential circuit (an electronic lock) that enters an accepting state only when 
it receives some particular four-bit sequence that you specify. 

3.13 Design a sequential circuit (a modulo-p counter) that increments a binary number by 
one on each step until it reaches the integer value p, at which point it resets its value to 
zero. You should assume that p is not a power of 2. 

3.14 Give an efficient design of an incrementing/decrementing counter, a sequential cir- 
cuit that increments or decrements a binary number modulo 2™. Specify the machine 
as an FSM and determine the number of gates in the sequential circuit in terms of n. 

RANDOM-ACCESS MACHINES 

3.15 Given a straight-line program for a Boolean function, describe the steps taken to com- 
pute it during fetch-and-execute cycles of a RAM. Determine whether jump instruc- 
tions are necessary to execute such programs. 

3.16 Consulting Theorem 3.4.1, determine whether jump instructions are necessary for all 
RAM computations. If not, what advantage accrues to using them? 

3.17 Sketch a RAM program using time and space 0(n) that recognizes strings of the form 
{0 m l m | 1 < m < n}. 
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ASSEMBLY-LANGUAGE PROGRAMMING 

3.18 Write an assembly-language program in the language of Fig. 3.18 to subtract two inte- 
gers. 

3.19 The assembly-language instructions of Fig. 3.18 operate on integers. Show that the 
operations AND, OR, and NOT can be realized on Boolean variables with these instruc- 
tions. Show also that these operations on vectors can be implemented. 

3.20 Write an assembly-language program in the language of Fig. 3.18 to form x y for inte- 
gers x and y. 

3.21 Show that the assembly-language instructions CLRR^, Rj <— Rj, JMP + Nj, andJMP_ 
Nj can be realized from the assembly-language instructions INC, DEC, CONTINUE, 
Rj JMP+ N i; and Rj JMP_ N*. 

TURING MACHINES 

3.22 In a standard Turing machine the tape unit has a left end but extends indefinitely to the 
right. Show that allowing the tape unit to be infinite in both directions does not add 
power to the Turing machine. 

3.23 Describe in detail a Turing machine with unlimited storage capacity that recognizes the 
language {0 m l m 1 1 < m}. 

3.24 Sketch a proof that in 0{n ) steps a Turing machine can verify that a particular tour 
of n cities in an instance of the Traveling Salesperson Problem satisfies the requirement 
that the total distance traveled is less than or equal to the limit k set on this instance of 
the Traveling Salesperson Problem. 

3.25 Design the additional circuitry needed to transform a sequential circuit for a random- 
access memory into one for a tape memory. Give upper bounds on the size and depth 
of the next-state and output functions that are simultaneously achievable. 

3.26 In the proof of Theorem 3.8.1 it is assumed that the words and their addresses in a 
RAM memory unit are placed on the tape of a Turing machine in order of increasing 
addresses, as suggested by Fig. 3.40. The addresses, which are [logm] bits in length, 
are organized as a collection of [ [log m] jb~\ b-bit words. (In the example, 6=1.) An 
address is written on tape cells that immediately precede the value of the corresponding 
RAM word. A RAM address addr is stored on the tape to the left in the shaded region. 
Assume that markers can be placed on cells. (This amounts to enlarging the tape al- 
phabet by a constant factor.) Show that markers can be used to move from the first 
word whose RAM address matches the ib most significant bits of the address a to the 
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Figure 3.40 A TM tape with markers on words and the first bit of each address. 



150 Chapter 3 Machines with Memory Models of Computation 

next one that matches the (i + 1)6 most significant bits. Show that this procedure can 
be used to find the RAM word whose address matches addr in 0{ (to/6) (log to) 2 ) 
Turing machine steps by a machine that can store in its control unit only one 6-bit 
subword of addr. 

3.27 Extend Problem 3.26 by demonstrating that the simulation can be done with a binary 
tape symbol alphabet. 

3.28 Extend Theorem 3.8.1 to show that there exists a Turing machine that can simulate an 
unbounded-memory RAM. 

3.29 Sketch a proof that every Turing machine can be simulated by a RAM program of the 
kind described in Section 3.4.3. 

Hint: Because such RAM programs can only have a finite number of registers, encode 
the contents of the TM tape as a number to be stored in one register. 

COMPUTATIONAL INEQUALITIES FOR TURING MACHINES 

3.30 Show that a one-tape Turing machine needs time exponential in n to compute most 
Boolean functions / : B n h Bonn variables, regardless of how much memory is 
allocated to the computation. 

3.31 Apply Theorem 3.2.2 to the one-tape Turing machine that executes T steps. Deter- 
mine whether the resulting inequalities are weaker or stronger than those given in The- 
orem 3.9.2. 

3.32 Write a program in your favorite language for the procedure WRITE_OR(£, to) intro- 
duced in Fig. 3.27. 

3.33 Write a program in your favorite language for the procedure WRITE_CELL_CIRCUIT(£, 
to) introduced in Fig. 3.27. 

Hint: See Problem 2.4. 

FIRST P-COMPLETE AND NP-COMPLETE PROBLEMS 

3.34 Show that the language MONOTONE CIRCUIT VALUE defined below is P-complete. 

MONOTONE CIRCUIT VALUE 

Instance: A description for a monotone circuit with fixed values for its input variables 

and a designated output gate. 

Answer: "Yes" if the output of the circuit has value 1 . 

Hint: Using dual-rail logic, find a way to translate (reduce) a string in the language 
CIRCUIT VALUE to a string in MONOTONE CIRCUIT VALUE by converting in loga- 
rithmic space (in the length of the string) a circuit over the standard basis to a circuit 
over the monotone basis. Note that, as stated in the text, the composition of two 
logarithmic-space reductions is a logarithmic-space reduction. To simplify the con- 
version from non-monotone circuits to monotone circuits, use even integers to index 
vertices in the non-monotone circuits so that both even and odd integers can be used 
in the monotone case. 

3.35 Show that the language FAN-OUT 2 CIRCUIT SAT defined below is NP-complete. 
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FAN-OUT 2 CIRCUIT SAT 

Instance: A description for a circuit of fan-out 2 with free values for its input variables 

and a designated output gate. 

Answer: "Yes" if the output of the circuit has value 1 . 

Hint: To reduce the fan-out of a vertex, replace the direct connections between a gate 
and its successors by a binary tree whose vertices are AND gates with their inputs con- 
nected together. Show that, for each gate of fan-out more than two, such trees can be 
generated by a program that runs in polynomial time. 

3.36 Show that clauses given in the proof of Theorem 3.9.7 are satisfied only when their 
variables have values consistent with the definition of the gate type. 

3.37 A circuit with n input variables {x\, x%, . . . , x n } is satisfiable if there is an assignment 
of values to the variables such that the output of the circuit has value 1 . Assume that 
the circuit has only one output and the gates are over the basis Q = {AND, OR, NOT}. 

a) Describe a nondeterministic procedure that accepts as input the description of a 
circuit in POSE and returns 1 if the circuit is satisfiable and otherwise. 

b) Describe a deterministic procedure that accepts as input the description of a circuit 
in POSE and returns 1 if the circuit is satisfiable and otherwise. What is the 
running time of this procedure when implemented on the RAM? 

c) Describe an efficient (polynomial-time) deterministic procedure that accepts as in- 
put the description of a circuit in SOPE and returns 1 if the circuit is satisfiable 
and otherwise. 

d) By using Boolean algebra, we can convert a circuit from POSE to SOPE. We can 
then use the result of the previous question to determine if the circuit is satisfiable. 
What is the drawback of this approach? 

CENTRAL PROCESSING UNIT 

3.38 Write an assembly-language program to multiply two binary numbers using the sim- 
ple CPU of Section 3.10. How large are the integers that can be multiplied without 
producing numbers that are too large to be recorded in registers? 

3.39 Assume that the simple CPU of Section 3.10 is modified to address an unlimited num- 
ber of memory locations. Show that it can realize any Boolean function by demonstrat- 
ing that it can compute the Boolean operations AND, OR, and NOT. 

3.40 Design a circuit to produce the timing variables tj, 1 < j < k, of the simple CPU. 
They must have the property that exactly one of them has value 1 at a time and they 
successively become 1 . 

Hint: Design a circuit that counts sequentially modulo k, an integer. That is, it incre- 
ments a binary number until it reaches k, after which it resets the number to zero. See 
Problem 3.13. 

3.41 Complete the design of the CPU of Section 3.10 by describing circuits for PC, MAR, 
MDR, OPC, INR, and OUTR. 

3.42 Show that an indirect store operation can be simulated by the computer of Section 3.10. 
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Hint: Construct a program that temporarily moves the value of AC aside, fetches the 
address containing the destination for the store, and uses Boolean operations to modify 
a STA instruction in the program so that it contains the destination address. 

3.43 Write an assembly-language program that repeatedly examines the input register until 
it is nonzero and then moves its contents to the accumulator. 

3.44 Sketch an assembly-language program to emulate a target CPU by a host CPU under 
the assumption that each CPU's instruction set supports indirection. Provide a skeleton 
program that reads an instruction from the target instruction set and decides which host 
instruction to execute. Also sketch the particular host instructions needed to emulate a 
target add instruction and a target jump-on-zero instruction. 



Chapter Notes 



Although the concept of the finite-state machine is fully contained in the Turing machine 
model (Section 3.7) introduced in 1936 [338], the finite-state machine did not become a se- 
rious object of study until the 1950s. Mealy [215] and Moore [223] introduced models for 
finite-state machines that were shown to be equivalent. The Moore model is used in Sec- 
tion 3.1. Rabin and Scott [266] introduced the nondeterministic machine, although not de- 
fined in terms of external choice inputs as it is defined here. 

The simulation of finite-state machines by logic circuits exhibited in Section 3.1.1 is due 
to Savage [285], as is its application to random-access (Section 3.6) and deterministic Tur- 
ing machines (Section 3.9.1) [286]. The design of a simple CPU owes much to the early 
simple computers but is not tied to any particular architecture. The assembly language of 
Section 3.4.3 is borrowed from Smith [312], 

The shallow circuits simulating finite-state machines described in Section 3.2 are due to 
Ladner and Fischer [186] and the existence of a universal Turing machine, the topic of Sec- 
tion 3.7, was shown by Turing [338]. 

Cook [74] identified the first NP-complete problem and Karp [159] demonstrated that a 
large number of other problems are NP-complete, including the Traveling Salesperson prob- 
lem. About this time Levin [199] (see also [335]) was led to similar concepts for combinatorial 
problems. Our construction in Section 3.9.1 of a satisfiable circuit follows the general out- 
line given by Papadimitriou [235] (who also gives the reduction to SATISFIABILITY) as well 
as the construction of a circuit simulating a deterministic Turing machine given by Savage 
[286]. Cook also identified the first P-complete problem [75,79]. Ladner [185] observed 
that the circuit of Theorem 3.9.1 could be written by a program using logarithmic space, 
thereby showing that CIRCUIT VALUE is P-complete. More information on P-complete and 
NP-complete problems can be found in Chapter 8. 

The more sophisticated simulation of a circuit by a Turing machine given in Section 3.9.7 
is due to Pippenger and Fischer [252] with improvements by Schnorr [301] and Savage, as 
cited by Schnorr. 



CHAPTER 




Finite-State Machines and 
Pushdown Automata 



The finite-state machine (FSM) and the pushdown automaton (PDA) enjoy a special place in 
computer science. The FSM has proven to be a very useful model for many practical tasks and 
deserves to be among the tools of every practicing computer scientist. Many simple tasks, such 
as interpreting the commands typed into a keyboard or running a calculator, can be modeled 
by finite-state machines. The PDA is a model to which one appeals when writing compilers 
because it captures the essential architectural features needed to parse context-free languages, 
languages whose structure most closely resembles that of many programming languages. 

In this chapter we examine the language recognition capability of FSMs and PDAs. We 
show that FSMs recognize exactly the regular languages, languages defined by regular expres- 
sions and generated by regular grammars. We also provide an algorithm to find a FSM that is 
equivalent to a given FSM but has the fewest states. 

We examine language recognition by PDAs and show that PDAs recognize exactly the 
context-free languages, languages whose grammars satisfy less stringent requirements than reg- 
ular grammars. Both regular and context-free grammar types are special cases of the phrase- 
structure grammars that are shown in Chapter 5 to be the languages accepted by Turing ma- 
chines. 

It is desirable not only to classify languages by the architecture of machines that recog- 
nize them but also to have tests to show that a language is not of a particular type. For this 
reason we establish so-called pumping lemmas whose purpose is to show how strings in one 
language can be elongated or "pumped up." Pumping up may reveal that a language does not 
fall into a presumed language category. We also develop other properties of languages that 
provide mechanisms for distinguishing among language types. Because of the importance of 
context-free languages, we examine how they are parsed, a key step in programming language 
translation. 
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4.1 Finite-State Machine Models 

The deterministic finite-state machine (DFSM), introduced in Section 3.1, has a set of states, 
including an initial state and one or more final states. At each unit of time a DFSM is given 
a letter from its input alphabet. This causes the machine to move from its current state to a 
potentially new state. While in a state, the DFSM produces a letter from its output alphabet. 
Such a machine computes the function defined by the mapping from strings of input letters 
to strings of output letters. DFSMs can also be used to accept strings. A string is accepted 
by a DFSM if the last state entered by the machine on that input string is a final state. The 
language recognized by a DFSM is the set of strings that it accepts. 

Although there are languages that cannot be accepted by any machine with a finite number 
of states, it is important to note that all realistic computational problems are finite in nature 
and can be solved by FSMs. However, important opportunities to simplify computations may 
be missed if we do not view them as requiring potentially infinite storage, such as that provided 
by pushdown automata, machines that store data on a pushdown stack. (Pushdown automata 
are formally introduced in Section 4.8.) 

The nondeterministic finite-state machine (NFSM) was also introduced in Section 3.1. 
The NFSM has the property that for a given state and input letter there may be several states 
to which it could move. Also for some state and input letter there may be no possible move. We 
say that an NFSM accepts a string if there is a sequence of next-state choices (see Section 3.1.5) 
that can be made, when necessary, so that the string causes the NFSM to enter a final state. 
The language accepted by such a machine is the set of strings it accepts. 

Although nondeterminism is a useful tool in describing languages and computations, non- 
deterministic computations are very expensive to simulate deterministically: the deterministic 
simulation time can grow as an exponential function of the nondeterministic computation 
time. We explore nondeterminism here to gain experience with it. This will be useful in 
Chapter 8 when we classify languages by the ability of nondeterministic machines of infinite 
storage capacity to accept them. However, as we shall see, nondeterminism offers no ad- 
vantage for finite-state machines in that both DFSMs and NFSMs recognize the same set of 
languages. 

We now begin our formal treatment of these machine models. Since this chapter is con- 
cerned only with language recognition, we give an abbreviated definition of the deterministic 
FSM that ignores the output function. We also give a formal definition of the nondeterministic 
finite-state machine that agrees with that given in Section 3.1.5. We recall that we interpreted 
such a machine as a deterministic FSM that possesses a choice input through which a choice 
agent specifies the state transition to take if more than one is possible. 

DEFINITION 4. 1 . 1 A deterministic finite-state machine (DFSM) M is a five-tuple M = 
(£, Q, 6, s, F) where £ is the input alphabet, Q is the finite set of states, i5 : Q x £ i— > Q is 
the next-state function, s is the initial state, and F is the set of final states. The DFSM M 
accepts the input string w £ S* if the last state entered by M on application ofivo starting in 
state s is a member ofithe set F . M recognizes the language L(M) consisting of all such strings. 
A nondeterministic FSM (NFSM) is similarly defined except that the next-state function 5 
is replaced by a next-set function S : Q x £ \— > 2® that associates a set of states with each 
state-input pair (q, a). The NFSM M accepts the stringy; 6 E* if there are next-state choices, 
whenever more than one exists, such that the last state entered under the input string w is a member 
ofF. M accepts the language L(M) consisting of all such strings. 



©John E Savage 



4.1 Finite-State Machine Models 



155 



Start 




Figure 4.1 The deterministic finite-state machines A-f oc ( ( j / / eveI1 that accepts strings containing 
an odd number of O's and an even number of Is. 



Figure 4.1 shows a DFSM A/ odd / ovcn with initial state go- The final state is shown as 
a shaded circle; that is, F = {qi}- -Modd/evcn is in state qo or q 2 as long as the number 
of Is in its input is even and is in state q\ or q$ as long as the number of Ts in its input is 
odd. Similarly, Af od( j/ ovon is in state go or q\ as long as the number of O's in its input is even 
and is in states qz or g 3 as long as the number of O's in its input is odd. Thus, M odd / even 
recognizes the language of binary strings containing an odd number of O's and an even number 
of l's. 

When the next-set function 5 for an NFSM has value S(q, a) = 0, the empty set, for 
state-input pair (q, a), no transition is specified from state q on input letter a. 

Figure 4.2 shows a simple NFSM ND with initial state go and final state set F = {qo, 
53,55}. Nondeterministic transitions are possible from states qo, 53, and 55. In addition, no 
transition is specified on input from states q\ and qi nor on input 1 from states qo, q$, 54, 
org 5 . 



Start 




Figure 4.2 The nondeterministic machine ND . 
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4.2 Equivalence of DFSMs and NFSMs 



Finite-state machines recognizing the same language are said to be equivalent. We now show 
that the class of languages accepted by DFSMs and NFSMs is the same. That is, for each 
NFSM there is an equivalent DFSM and vice versa. The proof has two symmetrical steps: a) 
given an arbitrary DFSM D\ recognizing the language L(D\), we construct an NFSM N\ 
that accepts L(D\), and b) given an arbitrary NFSM N2 that accepts L(N2J, we construct a 
DFSM Di that recognizes L(N-i). The first half of this proof follows immediately from the 
fact that a DFSM is itself an NFSM. The second half of the proof is a bit more difficult and 
is stated below as a theorem. The method of proof is quite simple, however. We construct a 
DFSM £>2 that has one state for each set of states that the NFSM N2 can reach on some input 
string and exhibit a next-state function for Di- We illustrate this approach with the NFSM 
7V 2 =M3ofFig. 4.2. 



Since the initial state of ND is qo, the initial state of D2 



Mr, 



the DFSM equivalent 



to ND, is the set {qo}. In turn, because qo has two successor states on input 0, namely q\ and 
q 2 , we let {gi, 92} be the successor to {qo} in Af cqu ; v on input 0, as shown in the following 
table. Since go has no successor on input 1, the successor to {qo} on input 1 is the empty set 0. 
Building in this fashion, we find that the successor to {qi, qi} on input 1 is {93, (74} whereas 
its successor on input is 0. The reader can complete the table shown below. Here <7 C quiv is 



the name of a state of the DFSM M, 



cquiv ■ 



Qcquiv 


a 


^M equlv (9cquiv,a) 


{*} 





{91,92} 


tio} 


1 





{91.92} 








{91.92} 


1 


{93,94} 


{93.94} 





{91,92,95} 


{93.94} 


1 





{91.92,95} 





{91,92} 


{91,92,95} 


1 


{93,94} 



{90} 

{91,92} 
{93.94} 
{91.92,95} 



a 
b 
c 
d 
QR 



In the second table above, we provide a new label for each state q aqu i v of M oqu i v . In 
Fig. 4.3 we use these new labels to exhibit the DFSM M cqu ; v equivalent to the NFSM ND of 
Fig. 4.2. A final state of M cqu j v is any set containing a final state of ND because a string takes 
-^oquiv to such a set if and only if it can take ND to one of its final states. We now show that 
this method of constructing a DFSM from an NFSM always works. 

THEOREM 4.2. 1 Let L be a language accepted by a nondeterministic finite-state machine M\. 
There exists a deterministic finite-state machine M.% that recognizes L. 

Proof Let M\ = (£, Q\, 5\,Si,F{) be an NFSM that accepts the language L. We design 
a DFSM Mi = (£, Q2, 62, S2, F%) that also recognizes L. Mi and M% have identical input 
alphabets, S. The states of M2 are associated with subsets of the states of Q\, which is 
denoted by Q2 Q 2® 1 , where 2^' is the power set of Qi containing all the subsets of Q\, 
including the empty set. We let the initial state S2 of AI2 be associated with the set {s\} 
containing the initial state of M\ . A state of M 2 is a set of states that M\ can reach on a 
sequence of inputs. A final state of M 2 is a subset of Qi that contains a final state of Mi. 
For example, if 55 S Fi, then {92, 95} € F%- 
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Figure 4.3 The DFSM M equ iv equivalent to the NFSM ND. 



We first give an inductive definition of the states of Mi. Let Q\ denote the sets of states 
of M\ that can be reached from S\ on input strings containing k or fewer letters. In the 



example given above, Q 



(i) 



{{?o}. {«!> &}> Qr} and Q 



(3) 



,(fc+l) 



w 



{{«o}>{<?1><72},{<73>94}, 



{l\y Q2> Qi}> Qr}- To construct Q 2 from Q 2 , we form the subset of Q\ that can be 
reached on each input letter from a subset in Q 2 , as illustrated above. If this is a new set, 
it is added to Q\ to form Q 2 . When Q 2 and Q 2 are the same, we terminate 
this process since no new subsets of Q\ can be reached from S\. This process eventually 
terminates because Q2 has at most 2'® 1 ' elements. It terminates in at most 2'® 1 ' — 1 steps 
because starting from the initial set {qo} at least one new subset must be added at each step. 

The next-state function 82 of M-i is defined as follows: for each state q of Mi (a subset 
of Q\), the value of 8i{q, a) for input letter a is the state of M2 (subset of Q\) reached from 
q on input a. As the sets Q 2 , . . . ,Q 2 are constructed, m < 2' < ^ 1 ' — 1, we construct a 
table for Sj- 

We now show by induction on the length of an input string z that if z can take M\ to 
a state in the set S C Q x , then it takes Mj_ to its state associated with S. It follows that if S 
contains a final state of -Mi, then z is accepted by both M\ and M 2 . 

The basis for the inductive hypothesis is the case of the empty input letter. In this case, 
S\ is reached by M\ if and only if {s\} is reached by M2. The inductive hypothesis is that 
if w of length n can take M\ to a state in the set S, then it takes M2 to its state associated 
with S. We assume the hypothesis is true on inputs of length n and show that it remains 
true on inputs of length n + 1. Let z = wa be an input string of length n + I. To show 
that z can take Mi to a state in S' if and only if it takes M 2 to the state associated with S', 
observe that by the inductive hypothesis there exists a set S C Qi such that w can take Mi 
to a state in S if and only if it takes M2 to the state associated with S. By the definition 
of 82, the input letter a takes the states of M t in S into states of Mj in 5" if and only if a 
takes the state of M 2 associated with S to the state associated with S'. It follows that the 
inductive hypothesis holds. ■ 

Up to this point we have shown equivalence between deterministic and nondeterministic 
FSMs. Another equivalence question arises in this context: It is, "Given an FSM, is there an 
equivalent FSM that has a smaller number of states?" The determination of an equivalent FSM 
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with the smallest number of states is called the state minimization problem and is explored 
in Section 4.7. 



4.3 Regular Expressions 



In this section we introduce regular expressions, algebraic expressions over sets of individual 
letters that describe the class of languages recognized by finite-state machines, as shown in the 
next section. 

Regular expressions are formed through the concatenation, union, and Kleene closure of 
sets of strings. Given two sets of strings L\ and L2, their concatenation L\ ■ L2 is the set 
{uv \ u (z Li and v G L2}; that is, the set of strings consisting of an arbitrary string of L\ 
followed by an arbitrary string of L^. (We often omit the concatenation operator •, writing 
variables one after the other instead.) The union of L\ and L2, denoted L\ U L2, is the set 
of strings that are in L\ or L2 or both. The Kleene closure of a set L of strings, denoted L* 
(also called the Kleene star), is defined in terms of the i-fold concatenation of L with itself, 
namely, L 1 = L ■ L l_1 , where L = {e}, the set containing the empty string: 

00 
L* = |J V 

!=0 

Thus, L* is the union of strings formed by concatenating zero or more words of L. Finally, we 
define the positive closure of L to be the union of all i-fold products except for the zeroth, 

that is, 

00 

L + = {J Ll 
j=i 

The positive closure is a useful shorthand in regular expressions. 

An example is helpful. Let L\ = {01, 11} and L2 = {0, aba}; then L\L,2 = {010, Olaba, 
110, llaba}, L x U L 2 = {0,01, 11, aba}, and 

L\ = {0, aba}* = {e, 0, aba, 00, Oaba, abaO, abaaba, . . .} 

Note that the definition given earlier for £*, namely, the set of strings over the finite alphabet 
E, coincides with this new definition of the Kleene closure. We are now prepared to define 
regular expressions. 

DEFINITION 4.3. 1 Regular expressions over the finite alphabet £ and the languages they de- 
scribe are defined recursively as follows: 

1. is a regular expression denoting the empty set. 

2. e is a regular expression denoting the set {e}. 

3. For each letter a G S, a is a regular expression denoting the set {a} containing a. 

4. Ifir and s are regular expressions denoting the languages R and S, then (rs), (r + s), and 
(r*) are regular expressions denoting the languages R ■ S, R U S, and R* , respectiv 



The languages denoted by regular expressions are called regular languages. (They are also often 
called regular sets.) 
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Figure 4.4 A finite-state machine computing the EXCLUSIVE OR of its inputs. 



Some examples of regular expressions will clarify the definitions. The regular expression 
(0 + 1)* denotes the set of all strings over the alphabet {0, 1}. The expression (0*)(1) 
denotes the strings containing zero or more 0's that end with a single 1. The expression 
((1)(0*)(1) + 0)* denotes strings containing an even number of Is. Thus, the expression 
((0*)(l))((l)(0*)(l) +0)* denotes strings containing an odd number of l's. This is exactly 
the class of strings recognized by the simple DFSM in Fig. 4.4. (So far we have set in boldface 
all regular expressions denoting sets containing letters. Since context will distinguish between 
a set containing a letter and the letter itself, we drop the boldface notation at this point.) 

Some parentheses in regular expressions can be omitted if we give highest precedence to 
Kleene closure, next highest precedence to concatenation, and lowest precedence to union. For 
example, we can write ((0*)(1))((1)(0*)(1) + 0)* as 0*1(10*1 + 0)*. 

Because regular expressions denote languages, certain combinations of union, concatena- 
tion, and Kleene closure operations on regular expressions can be rewritten as other combina- 
tions of operations. A regular expression will be treated as identical to the language it denotes. 
Two regular expressions are equivalent if they denote the same language. We now state 
properties of regular expressions, leaving their proof to the reader. 

THEOREM 4.3. 1 Let% ande. be the regular expressions denoting the empty set and the set contain- 
ing the empty string and let r, s, and t be arbitrary regular expressions. Then the rules shown in 
Fig. 4.5 hold. 

We illustrate these rules with the following example. Let a = 0*1- 6+0* , where b = c- 10 + 
and c = (0 + 10 + 1)*. Using rule (16) of Fig. 4.5, we rewrite c as follows: 

c= (0+10+1)* = (0*10+1)*0* 

Then using rule (15) with r = 0* 10 + and s = 1, we write 6 as follows: 

b= (0*10 + 1)*0*10 + = (rs)*r = r(sr)* = 0*10+(10*10+)* 

It follows that a satisfies 

a = 0*1-6+0* 

= 0*10*10+(10*10+)* + 0* 

= 0*(10*10+)+ + 0* 

= 0*((10*10+)+ + e) 

= 0*(10*10+)* 
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(1) 


r$ 


= 


0r 


= 


(2) 


re 


= 


er - 


= r 


(3) 


r + 


= 


+ r 


= r 


(4) 


r + r 


= 


r 




(5) 


r + s 


= 


s + r 




(6) 


r(s + t) 


= 


rs + rt 




(7) 


(r + s)t 


= 


rt + st 




(8) 


r(st) 


= 


(rs)t 




(9) 


0* 


= 


e 




(10) 


e* 


= 


e 




(11) 


(e + r) + 


= 


/'"' 




(12) 


(e + r)* 


= 


* 

r 




(13) 


r*(e + r) 


= 


(e + r)r* - 


= r 


(14) 


r*s + s 


= 


r* s 




(15) 


r(sr)* 


= 


(rs)*r 




(16) 


(r + s)* 


= 


(r*s)*r* -- 


= (< 



(s*r)*s* 



Figure 4.S Rules that apply to regular expressions. 



where we have simplified the expressions using the definition of the positive closure, namely 
r(r*) = r + in the second equation and rules (6), (5), and (12) in the last three equations. 
Other examples of the use of the identities can be found in Section 4.4. 



4.4 Regular Expressions and FSMs 



Regular languages are exactly the languages recognized by finite-state machines, as we now 
show. Our two-part proof begins by showing (Section 4.4.1) that every regular language can 
be accepted by a nondeterministic finite-state machine. This is followed in Section 4.4.2 by 
a proof that the language recognized by an arbitrary deterministic finite-state machine can be 
described by a regular expression. Since by Theorem 4.2.1 the language recognition power of 
DFSMs and NFSMs are the same, the desired conclusion follows. 



4.4.1 Recognition of Regular Expressions by FSMs 

THEOREM 4.4.1 Given a regular expression r over the set S, there is a nondeterministic finite-state 
machine that accepts the language denoted by r. 

Proof We show by induction on the size of a regular expression r (the number of its opera- 
tors) that there is an NFSM that accepts the language described by r. 

BASIS: If no operators are used, the regular expression is either e, 0, or a for some a G S. 
The finite-state machines shown in Fig. 4.6 recognize these three languages. 
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Start 




Start 





Start 




(a) 



(b) 



(c) 



Figure 4.6 Finite-state machines recognizing the regular expressions e, 0, and a, respectively. 
In b) an output state is shown even though it cannot be reached. 



INDUCTION: Assume that the hypothesis holds for all regular expressions r with at most k 
operators. We show that it holds for fc + 1 operators. Since k is arbitrary, it holds for all k. 
The outermost operator (the k + 1st) is either concatenation, union, or Kleene closure. We 
argue each case separately. 

CASE 1: Let r = (Vi • r-x). M\ and M 2 are the NFSMs that accept r\ and r%, respectively. 
By the inductive hypothesis, such machines exist. Without loss of generality, assume that the 
states of these machines are distinct and let them have initial states S\ and S2, respectively. 
As suggested in Fig. 4.7, create a machine M that accepts r as follows: for each input letter 
a, final state / of Mi, and state q of M 2 reached by an edge from s 2 labeled cr, add an edge 
with the same label a from / to q. If s 2 is not a final state of M 2 , remove the final state 
designations from states of M\ . 

It follows that every string accepted by M either terminates on a final state of M\ (when 
M 2 accepts the empty string) or exits a final state of Mi (never to return to a state of Mi), 
enters a state of M 2 reachable on one input letter from the initial state of M 2 , and terminates 
on a final state of M 2 . Thus, M accepts exactly the strings described by r. 

CASE 2: Let r = (ri + r 2 ). Let M : and M 2 be NFSMs with distinct sets of states and let 
initial states Si and s 2 accept T\ and r 2 , respectively. By the inductive hypothesis, Mi and 
M 2 exist. As suggested in Fig. 4.8, create a machine M that accepts r as follows: a) add a 
new initial state So', b) for each input letter a and state q of Mi or M 2 reached by an edge 




Figure 4.7 A machine M recognizing n • T%. M\ and Mi are the NFSMs that accept T\ and 
ri , respectively. An edge with label a is added between each final state of M \ and each state of M% 
reached on input a from its start state, s 2 . The final states of M 2 are final states of M , as are the 
final states of M\ if S2 is a final of Mi. It follows that this machine accepts the strings beginning 
with a string in T\ followed by one in n. 
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Figure 4.8 A machine M accepting n + r 2 . M\ and M 2 are the NFSMs that accept n and 
ri , respectively. The new start state s has an edge labeled a for each edge with this label from the 
initial state of Mi or M2 . The final states of M are the final states of M\ and M% as well as So if 
either S\ or Sj is a final state. After the first input choice, the new machine acts like either Mi or 
Mi. Therefore, it accepts strings denoted by r\ + r%. 



from S\ or s 2 labeled a, add an edge with the same label from so to q. If either S\ or s 2 is a 
final state, make so a final state. 

It follows that if either M\ or M 2 accepts the empty string, so does M. On the first 
non-empty input letter M enters and remains in either the states of AI\ or those of M 2 . It 
follows that it accepts either the strings accepted by M\ or those accepted by M 2 (or both), 
that is, the union of 7*1 and r 2 . 

CASE 3: Let r = (n)*. Let Mi be an NFSM with initial state S\ that accepts r\, which, 
by the inductive hypothesis, exists. Create a new machine M, as suggested in Fig. 4.9, as 
follows: a) add a new initial state So; b) for each input letter a and state q reached on a from 
S\, add an edge with label a between sq and state q with label a, as in Case 2; c) add such 
edges from each final state to these same states. Make the new initial state a final state and 
remove the initial-state designation from S\. 

It follows that M accepts the empty string, as it should since r = (ri)* contains the 
empty string. Since the edges leaving each final state are those directed away from the initial 
state So, it follows that M accepts strings that are the concatenation of strings in r\, as it 
should. ■ 



We now illustrate this construction of an NFSM from a regular expression. Consider the 
regular expression r = 10* + 0, which we decompose as r = (rir 2 + r$) where T\ = 1, 
fi = ( r 4)*> r 3 = 0, and r4 = 0. Shown in Fig. 4.10(a) is a NFSM accepting the languages 
denoted by the regular expressions r^ and r.4, and in (b) is an NFSM accepting r\ . Figure 4.11 
shows an NFSM accepting the closure of r^ obtained by adding a new initial state (which is 
also made a final state) from which is directed a copy of the edge directed away from the initial 
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Figure 4.9 A machine M accepts r*. Mi accepts r\. Make Sq the initial state of M. For 
each input letter a, add an edge labeled a from So and each final of M\ to each state reached on 
input a from Si, the initial state of M\. The final states of M are s and the final states of M\, 
Thus, M accepts e and all states reached by the concatenation of strings accepted by M\ ; that is, 
it realizes the closure rt . 



Start 




Start f \ 1 

(a) (b) 

Figure 4. 1 Nondeterministic machines accepting and 1. 





Start 



Figure 4. 1 I An NFSM accepting the Kleene closure of {0}. 




Figure 4.12 A nondeterministic machine accepting 10*. 
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Start 




Figure 4. 1 3 A nondeterministic machine accepting 10* + 0. 



state of Mo, the machine accepting r^. (The state S\ is marked as inaccessible.) Figure 4.12 
(page 163) shows an NFSM accepting r\r 2 constructed by concatenating the machine M.\ 
accepting T\ with M 2 accepting r 2 . {s\ is inaccessible.) Figure 4.13 gives an NFSM accepting 
the language denoted by T\Tx\-r^, designed by forming the union of machines for r\T2 and r^. 
(States S2 and S3 are inaccessible.) Figure 4.14 shows a DFSM recognizing the same language 
as that accepted by the machine in Fig. 4.13. Here we have added a reject state qr to which all 
states move on input letters for which no state transition is defined. 

4.4.2 Regular Expressions Describing FSM Languages 

We now give the second part of the proof of equivalence of FSMs and regular expressions. We 
show that every language recognized by a DFSM can be described by a regular expression. We 
illustrate the proof using the DFSM of Fig. 4.3, which is the DFSM given in Fig. 4.15 except 
for a relabeling of states. 

THEOREM 4.4.2 If the language L is recognized by a DFSM M = (£, Q, S, s, F), then L can 
be represented by a regular expression. 




Start 



Figure 4. 1 4 A deterministic machine accepting 10* +0. 
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Start 




Figure 4. 1 5 The DFSM of Figure 4.3 with a relabeling of states. 



Proof Let Q = {q\, q%, . . . , q n } and F = {qj^qj,, ■ ■ ■ ,Qj } be the final states. The 
proof idea is the following. For every pair of states (qi, qj) of M we construct a regular 



(0) 



expression . s 



,(0) 



denoting the set R\ containing input letters that take M from q t to qj 



(0) 



without passing through any other states. If i = j, R\ contains the empty letter e because 
M can move from qi to qi without reading an input letter. (These definitions are illustrated 

in the table T^ of Fig. 4.16.) For k = 1,2 mw proceed to define the set R\ ■ of 

strings that take M from qi to qj without passing through any state except possibly one in 



Q\ k > = {q\,q2, . . . ,qk}- We also associate a regular expression r\ ■ with the set R\ ■ . Since 
Q( n > = Q, the input strings that carry M from s = q t , the initial state, to a final state in F 
are the strings accepted by M, They can be described by the following regular expression: 



» , J") 



' *ji 



' t,h 






This method of proof provides a dynamic programming algorithm to construct a reg 
ular expression for L. 



r(o) = {r^ J ) 



A j 


1 


2 


3 


4 


5 














i 


£ 





1 








2 





e 





1 





3 








£ + 0+ 1 








4 








1 


£ 





5 











1 


£ 



CO)-, 



Figure 4. 1 6 The table T ' containing the regular expressions {rj } associated with the DFSM 
in shown in Fig. 4.15. 
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R\ ■ is formally defined belov 



R 



(o) 



{a | 5(q l ,a) 
{a | 6(qi,a) 



qj}U{e} i£i = j 



? (fe) 



Since R\ is defined as the set of strings that take M from qi to qj without passing through 
states outside of Q' ', it can be recursively defined as the strings that take M from g, to 
qj without passing through states outside of Q( ' plus those that take M from qi to qu 
without passing through states outside of Q( >, followed by strings that take M from 
qu to qu zero or more times without passing through states outside Q^ ' , followed by 
strings that take M from qk to qj without passing through states outside of Q^ fc_1 ). This is 
represented by the formula below and suggested in Fig. 4.17: 



R 



{k) = i? (fc_1) u R {k ~ l) 

i, j 1,3 i,k 



? (fe) 



R 



(fc-i) 

k,k 



R 



(fc-i) 
k,j 



It follows by induction on k that R\ ■ correctly describes the strings that take M from qi to 
qj without passing through states of index higher than k. 

We now exhibit the set {?$.•} of regular expressions that describe the sets {R\a | 1 < 

i, j,k < m} and establish the correspondence by induction. If the set R\ contains the 



letters X\,X2, . . .,Xi (which might include the empty letter e), then we let r. 



(0) 



1E1+X2 + 



■+xi . Assume that r\ ■ correctly describes R\ . It follows that the regular expression 



(fe) 
r- - 



(fc-i) 
r- ■ 
1,3 



Jk-1) r {k-l) 
1 i,k [' k,k 



(fc-1) 

r k 3 



(4.1) 



? (fe 



correctly describes R\ . This concludes the proof. 



The dynamic programming algorithm given in the above proof is illustrated by the DFSM 
in Fig. 4.15. Because this algorithm can produce complex regular expressions even for small 
DFSMs, we display almost all of its steps, stopping when it is obvious which results are needed 
for the regular expression that describes the strings recognized by the DFSM. For 1 < k < 6, 




Figure 4.17 A recursive decomposition of the set R i of strings that cause an FSM to move 
from state qi to qj without passing through states qi for I > k. 
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let T( fe ) denote the table of values of {rj ■ | 1 < i, j < 6}. Table T^ in Fig. 4.16 describes 
the next-state function of this DFSM. The remaining tables are constructed by invoking the 
definition of r\ • in (4.1). Entries in table T^ 1 ' are formed using the following facts: 



„(i) 



..«» 



, _(o) fJo)\ _(o). / (o; 
+ 'ii l 'i,i y 'ij ' l '1,1 



,(0) 



for i > 2 



It follows that r; 



r 



,(i) 



r« f r (1) 



(0) 

I 
2.2 / ' 2j 



or that T' ' is identical to T' ', Invoking the identity 
= e, we construct the table T^ 2 ' below: 



.(2) 
i,3 



(1) J ■ / (1) 

r, „• and using ( r 



rw = {rm 



i\j 
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2 
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4 


5 














1 


e 





1+00 


01 





2 





€ 
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3 








£ + 0+ 1 








4 
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e 





5 








00 


1 + 01 


e 



The fourth table T^> is shown below. It is constructed using the identity r\ 



(3) 
3 



.,(2) 



rS(rS) 



„( 2 ) 



r^ ■ and the fact that I r 3 3 



,( 2 ) 



(0 



T( 3 ) = 


= {rg} 
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4 
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e 





(l+00)(0+ 1)* 
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e 


0(0+1)* 


1 
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(0+1)* 
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e 
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00(0+ 1)* 


1+01 


e 



The fifth table T" is shown below. It is constructed using the identity 



..(•') 



.(3) 



.(3) 



(-2)' 



(3) 
?4 and the fact that 



(-sr 



T (4) = 


= {^} 
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(l+00 + 011)(0+l)* 
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010 


2 
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(0+ll)(0+ 1)* 
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1(0+1)* 
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(00+ll+011)(0+l)* 


1+01 


e+ 10 + 010 
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Instead of building the sixth table, T^' , we observe that the regular expression that is 
needed is r = r\j + r\J + r\j. Since r\J = r\J + r\J ('5,5) r \, 3 an d M5 
(10 + 010)*, we have the following expressions for r\(, r\ 4 , and r\ J: 



,(5) _ 

11 — c 



r® =01 + (010)(10 + 010)*(1+01) 

r® = 010+(010)(10 + 010)*(e + 10 + 010) = (010)(10 + 010)* 

Thus, the DFSM recognizes the language denoted by the regular expression r = e + 01 + 
(010)(10 + 010)*(e+l + 01). It can be shown that this expression denotes the same language 
as does e + 01 + (01)(01 + 001)*(e + 0) = (01 + 010)*. (See Problem 4.12.) 

4.4.3 grep — Searching for Strings in Files 

Many operating systems provide a command to find strings in files. For example, the Unix 
grep command prints all lines of a file containing a string specified by a regular expression, 
grep is invoked as follows: 

grep regular-expression file_name 

Thus, the command grep ' 0+' filejiame returns each line of the file file_name that 
contains o + somewhere in the line, grep is typically implemented with a nondeterministic 
algorithm whose behavior can be understood by considering the construction of the preceding 
section. 

In Section 4.4.1 we describe a procedure to construct NFSMs accepting strings denoted 
by regular expressions. Each such machine starts in its initial state before processing an input 
string. Since grep finds lines containing a string that starts anywhere in the lines, these NFSMs 
have to be modified to implement grep. The modifications required for this purpose are 
straightforward and left as an exercise for the reader. (See Problem 4.19.) 



4.5 The Pumping Lemma for FSMs 



It is not surprising that some languages are not regular. In this section we provide machinery 
to show this. It is given in the form of the pumping lemma, which demonstrates that if a 
regular language contains long strings, it must contain an infinite set of strings of a particular 
form. We show the existence of languages that do not contain strings of this form, thereby 
demonstrating that they are not regular. 

The pigeonhole principle is used to prove the pumping lemma. It states that if there are 
n pigeonholes and n + 1 pigeons, each of which occupies a hole, then at least one hole has two 
pigeons. This principle, whose proof is obvious (see Section 1.3), enjoys a hallowed place in 
combinatorial mathematics. 

The pigeonhole principle is applied as follows. We first note that if a regular language L 
is infinite, it contains a string w with at least as many letters as there are states in a DFSM M 
recognizing L. Including the initial state, it follows that M visits at least one more state while 
processing w than it has different states. Thus, at least one state is visited at least twice. The 
substring of w that causes M to move from this state back to itself can be repeated zero or 
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more times to give other strings in the language. We use the notation u n to mean the string 
repeated n times and let u° = e. 

LEMMA 4.5. 1 Let L be a regular language over the alphabet £ recognized by a DFSM with m 
states. If in £ L and \w\ > m, then there are strings r , s, andt with \s\ > 1 and\rs\ < m 
such that w = rst and for all integers n > 0, rs n t is also in L. 

Proof Let L be recognized by the DFSM M with m states. Let k = \w\ > m be the length 
of id in L. Let qo, q\, q%, ■ ■ ■ , Qk denote the initial and k successive states that M enters after 
receiving each of the letters in w. By the pigeonhole principle, some state q' in the sequence 
q , . . . , q m (m < k) is repeated. Let qi = qj = q' for i < j. Let r = W\ . . . W, be the 
string that takes M from go to qi = q' (this string may be empty) and let s = Wi+\ . . .Wj 
be the string that takes M from qi = q' to qj = q' (this string is non-empty). It follows 
that \rs\ < m. Finally, let t = Wj +1 . . . Wk be the string that takes M from qj to qu- Since 
s takes M from state q 1 to state q' , the final state entered by M is the same whether s is 
deleted or repeated one or more times. (See Fig. 4.18.) It follows that rs n t is in L for all 
n > 0. ■ 

As an application of the pumping lemma, consider the language L = {0 P \ P \ p > 1}. 
We show that it is not regular. Assume it is regular and is recognized by a DFSM with m 
states. We show that a contradiction results. Since L is infinite, it contains a string w of length 
k = 2p > 2m, that is, with p > rn. By Lemma 4.5.1 L also contains rs n t, n > 0, where 
w = rst and \rs\ < m < p. That is, s = d where d < p. Since rs n t = p+ ( n-1 ) d l p for 
n > and this is not of the form P 1 P for n = and n > 2, the language is not regular. 

The pumping lemma allows us to derive specific conditions under which a language is 
finite or infinite, as we now show. 

LEMMA 4.5.2 Let L be a regular language recognized by a DFSM with m states. L is non-empty 
if and only if it contains a string of length less than m. Lt is infinite if and only if it contains a string 
of length at least m and at most 2m — 1 . 

Proof If L contains a string of length less than m, it is not empty. If it is not empty, let w 
be a shortest string in L. This string must have length at most m — 1 or we can apply the 
pumping lemma to it and find another string of smaller length that is also in L. But this 
would contradict the assumption that w is a shortest string in L. Thus, L contains a string 
of length at most m — \. 

If L contains a string w of length m < \w\ < 2m — 1 , as shown in the proof of the 
pumping lemma, w can be "pumped up" to produce an infinite set of strings. Suppose now 
that L is infinite. Either it contains a string w of length m < \w\ < 2m — 1 or it does not. 




Start 



Figure 4. 1 8 Diagram illustrating the pumping lemma. 
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In the first case, we are done. In the second case, \w\ > 2m and we apply the pumping 
lemma to it to find another shorter string that is also in L, contradicting the hypothesis that 
it was the shortest string of length greater than or equal to 2m. ■ 

4.6 Properties of Regular Languages 

Section 4.4 established the equivalence of regular languages (recognized by finite-state ma- 
chines) and the languages denoted by regular expressions. We now present properties satisfied 
by regular languages. We say that a class of languages is closed under an operation if ap- 
plying that operation to a language (or languages) in the class produces another language in 
the class. For example, as shown below, the union of two regular languages is another regular 
language. Similarly, the Kleene closure applied to a regular language returns another regular 
language. 

Given a language L over an alphabet S, the complement of L is the set L = S* — L, 
the strings that are in E* but not in L. (This is also called the difference between E* and L.) 
The intersection of two languages L\ and L 2 , denoted L\ C\ L 2 , is the set of strings that are 
in both languages. 

THEOREM 4.6. 1 The class of regular languages is closed under the following operations: 

• concatenation 

• union 

• Kleene closure 

• complementation 

• intersection 

Proof In Section 4.4 we showed that the languages denoted by regular expressions are ex- 
actly the languages recognized by finite-state machines (deterministic or nondeterministic). 
Since regular expressions are defined in terms of concatenation, union, and Kleene closure, 
they are closed under each of these operations. 

The proof of closure of regular languages under complementation is straightforward. If 
L is regular and has an associated FSM M that recognizes it, make all final states of Al non- 
final and all non-final states final. This new machine then recognizes exactly the complement 
of L. Thus, L is also regular. 

The proof of closure of regular languages under intersection follows by noting that if L\ 
and L 2 are regular languages, then 



L x r\L 2 = Li UL 2 

that is, the intersection of two sets can be obtained by complementing the union of their 
complements. Since each of L\ and L 2 is regular, as is their union, it follows that L\ U L 2 
is regular. (See Fig. 4.19(a).) Finally, the complement of a regular set is regular. ■ 

When we come to study Turing machines in Chapter 5, we will show that there are well- 
defined languages that have no machine to recognize them, even if the machine has an infinite 
amount of storage available. Thus, it is interesting to ask if there are algorithms that solve 
certain decision problems about regular languages in a finite number of steps. (Machines that 
halt on all input are said to implement algorithms.) As shown above, there are algorithms 
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Figure 4. 19 (a) The intersection L\ D Li of two sets L\ and L2 can be obtained by taking the 
complement L\ U Li of the union L\ U Li of their complements, (b) If L(M\) C L[M2j, then 
L(Af,)nL(M 2 ) = 0. 



that can recognize the concatenation, union and Kleene closure of regular languages. We now 
show that algorithms exist for a number of decision problems concerning finite-state machines. 

THEOREM 4.6.2 There are algorithms for each of the following decision problems: 

a) For a finite-state machine M and a string w, determine if w G L(A1). 

b) For a finite-state machine M, determine if L(M) = 0. 

c) For a finite-state machine M, determine if L(M) = £*. 

d) For finite-state machines M\ and Mj, determine if L(M\) C L(M 2 ). 

e) For finite-state machines M\ and M 2 , determine if L{M\) = L(M 2 ). 

Proof To answer (a) it suffices to supply w to a deterministic finite-state machine equiva- 
lent to M and observe the final state after it has processed all letters in w. The number of 
steps executed by this machine is the length of w. Question (b) is answered in Lemma 4.5.2. 
We need only determine if the language contains strings of length less than m, where m is 
the number of states of M . This can be done by trying all inputs of length less than m. 
The answer to question (c) is the same as the answer to "Is L(M) = 0?" The answer to 
question (d) is the same as the answer to "Is L(M\) fl L(M 2 ) = 0?" (See Fig. 4.19(b).) 
Since FSMs that recognize the complement and intersection of regular languages can be 
constructed in a finite number of steps (see the proof of Theorem 4.6.1), we can use the 
procedure for (b) to answer the question. Finally, the answer to question (e) is "yes" if and 
onlyifi(Mi) C L(M 2 ) andL(M 2 ) C L(M,). ■ 



4.7 State Minimization* 

Given a finite-state machine M, it is often useful to have a potentially different DFSM M m [ n 
with the smallest number of states (a minimal-state machine) that recognizes the same language 
L(M). In this section we develop a procedure to find such a machine recognizing a regular 
language L. As a step in this direction, we define a natural equivalence relation Rl for each lan- 
guage L and show that L is regular if and only if Rl has a finite number of equivalence classes. 

4.7. 1 Equivalence Relations on Languages and States 

The relation Rl is used to define a machine Ml. When L is regular, we show that Ml is a 
minimal-state DFSM. We also give an explicit procedure to construct a minimal-state DFSM 
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recognizing a regular language L. The approach is the following: a) given a regular expression, 
an NFSM is constructed (Theorem AAA); b) an equivalent DFSM is then produced (Theo- 
rem 4.2.1); c) equivalent states of this DFSM are discovered and coalesced, thereby producing 
the minimal machine. We begin our treatment with a discussion of equivalence relations. 

DEFINITION 4.7. 1 An equivalence relation R on a set A is a partition of the elements of A into 
disjoint subsets called equivalence classes. If two elements a and b are in the same equivalence 
class under relation R, we write aRb. If a is an element of an equivalence class, we represent its 
equivalence class by [a]. An equivalence relation is represented by its equivalence classes. 

An example of equivalence relation on the set A = {0, 1,2,3} is the set of equivalence 
classes {{0, 2}, {1, 3}}. Then, [0] and [2] denote the same equivalence class, namely {0,2}, 
whereas [1] and [2] denote different equivalence classes. 

Equivalence relations can be defined on any set, including the set of strings over a finite 
alphabet (a language). For example, let the partition {0*, 0(0*10*) + , 1(0 + 1)*} of the 
set (0 + 1)* denote the equivalence relation R. The equivalence classes consist of strings 
containing zero or more 0's, strings starting with and containing at least one 1, and strings 
beginning with 1. It follows that 00E000 and 1001.R11 but not that 10/201. 

Additional conditions can be put on equivalence relations on languages. An important 
restriction is that an equivalence relation be right-invariant (with respect to concatenation). 

DEFINITION 4.7.2 An equivalence relation R over the alphabet S is right-invariant (with respect 
to concatenation) if for allu and v in £*, uRv implies uzRvz for all z G £*. 

For example, let R = {(10*1 + 0)*, 0*1(10*1 + 0)*}. That is, R consists of two equiv- 
alence classes, the set containing strings with an even number of Is and the set containing 
strings with an odd number of Is. R is right-invariant because if uRv; that is, if the numbers 
of 1 's in u and v are both even or both odd, then the same is true of uz and vz for each 
2GE*, that is, uzRvz. 

To each language /, whether regular or not, we associate the natural equivalence relation 
Rl defined below. Problem 4.30 shows that for some languages Rl has an unbounded number 
of equivalence classes. 

DEFINITION 4.7.3 Given a language L over S, the equivalence relation Rl is defined as follows: 
strings u,v G X* are equivalent, that is, uRr,v, if and only if for each z G £*, either both uz 
and vz are in L or both are not in L. 

The equivalence relation R = {(10*1+0)*, 0*1(10*1+0)*} given above is the equivalence 
relation R L for both the language L = (10*1 + 0)* and the language L = 0*1(10*1 + 0)*. 

A natural right-invariant equivalence relation on strings can also be associated with each 
DFSM, as shown below. This relation defines two strings as equivalent if they carry the ma- 
chine from its initial state to the same state. Thus, for each state there is an equivalence class 
of strings that take the machine to that state. For this purpose we extend the state transition 
function 6 to strings a G S* recursively by S(q, e) = q and 8{q,aa) = 5{5{q,a),a) for 
a e E. 

DEFINITION 4.7.4 Given a DFSM M = (E, Q, S, s, F), Rm is the equivalence relation defined 
as follows: for all u,v G E*, uRmV if and only if 6 (s,u) = 5(s,v). (Note that 5 (q,e) = q.) 



©John E Savage 4.7 State Minimization* 173 

It is straightforward to show that the equivalence relations Rl and Rm are right-invariant. 
(See Problems 4.28 and 4.29.) It is also clear that Rm has as many equivalence classes as there 
are accessible states of M. 

Before we present the major results of this section we define a special machine Ml that 
will be seen to be a minimal machine recognizing the language L. 

DEFINITION 4.7.5 Given the language L over the alphabet S with finite Rl, the DFSM Ml = 
(£, Ql, Sl> s l, Fl) is defined in terms of the right-invariant equivalence relation Rl as follows: 
a) the states Ql are the equivalence classes o/Rl; b) the initial state Sl is the equivalence class 
[e]; c) the final states Fl are the equivalence classes containing strings in the language L; d) for an 
arbitrary equivalence class [u] with representative element u G S* and an arbitrary input letter 
sGE, the next-state transition function Sl '■ Ql x £ i— > Ql is defined by 5 l([u}, a) = [ua]. 

For this definition to make sense we must show that condition c) does not contradict the 
facts about Rl: that an equivalence class containing a string in L does not also contain a 
string that is not in L. But by the definition of Rl, if we choose z = e, we have that uRlV 
only if both u and v are in L. We must also show that the next-state function definition is 
consistent: it should not matter which representative of the equivalence class [u] is used. In 
particular, if we denote the class [u] by [v] for v another member of the class, it should follow 
that [ua] = [va\. But this is a consequence of the definition of Rl- 

Figure 4.20 shows the machine Ml associated with L = (10*1 + 0)*. The initial state 
is associated with [e], which is in the language. Thus, the initial state is also a final state. The 
state associated with [0] is also [e] because e and are both in L. Thus, the transition from state 
[e] on input is back to state [e]. Problem 4.31 asks the reader to complete the description of 
this machine. 

We need the notion of a refinement of an equivalence relation before we establish condi- 
tions for a language to be regular. 

DEFINITION 4.7.6 An equivalence relation R over a set A is a refinement of an equivalence 
relation S over the same set ifaRb implies that aSb. A refinement R ofS is strict if there exist 
a,b £ A such that aSb but it is not true that aRb. 

Over the set A = {a, b, c, d}, the relation R = {{a} , {b} , {c, d}} is a strict refinement 
of the relation S = {{a, b}, {c, d}}. Clearly, if R is a refinement of S, R has no fewer 
equivalence classes than does S. If the refinement R of S is strict, R has more equivalence 
classes than does S. 




Figure 4.20 The machine M L associated with L = (10*1 + 0)*. 
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4.7.2 The Myhill-Nerode Theorem 

The following theorem uses the notion of refinement to give conditions under which a lan- 
guage is regular. 

THEOREM 4.7. 1 (Myhill-Nerode) L is a regular language if and only if Rl has a finite num- 
ber of equivalence classes. Furthermore, if L is regular, it is the union of some of the equivalence 
classes of Rl- 

Proof We begin by showing that if L is regular, Rl has a finite number of equivalence 
classes. Let L be recognized by the DFSM M = (T,,Q,S, s, F). Then the number of 
equivalence classes of Rm is finite. Consider two strings u,v £ E* that are equivalent 
under Rm- By definition, u and V carry M from its initial state to the same state, whether 
final or not. Thus, uz and vz also carry M to the same state. It follows that Rm is right- 
invariant. Because uRmv, either u and v take M to a final state and are in L or they take 
M to a non-final state and are not in L. It follows from the definition of Rl that uRlV. 
Thus, Rm is a refinement of Rl- Consequently, Rl has no more equivalence classes than 
does Rm and this number is finite. 

Now let Rl have a finite number of equivalence classes. We show that the machine 
Ml recognizes L. Since it has a finite number of states, we are done. The proof that Ml 
recognizes L is straightforward. If [w] is a final state, it is reached by applying to Ml in 
its initial state a string in [w]. Since the final states are the equivalence classes containing 
exactly those strings that are in L, Ml recognizes L. It follows that if L is regular, it is the 
union of some of the equivalence classes of Rl- ■ 

We now state an important corollary of this theorem that identifies a minimal machine 
recognizing a regular language L. Two DFSMs are isomorphic if they differ only in the names 
given to states. 

COROLLARY 4.7. 1 If L is regular, the machine M l is a minimal DFSM recognizing L. All other 
such minimal machines are isomorphic to Ml- 

Proof From the proof of Theorem 4.7.1, if M is any DFSM recognizing L, it has no fewer 
states than there are equivalence classes of Rl, which is the number of states of Ml- Thus, 
Ml has a minimal number of states. 

Consider another minimal machine M = (T,,Q q ,5(j,Sq,F ). Each state of M can 
be identified with some state of Ml- Equate the initial states of Ml and M and let q be 
an arbitrary state of Mo. There is some string u G S* such that q = So(so,u). (If not, 
Mq is not minimal.) Equate state q with state 5l(sl,u) = [u] of Ml- Let v S [it]. 
If 5o(sq,v) 7^ q, Mq has more states than does Ml, which is a contradiction. Thus, the 
identification of states in these two machines is consistent. The final states Fq of M Q are 
identified with those equivalence classes of Ml that contain strings in L. 

Consider now the next-state function 5q of Mq. Let state q of Mq be identified with 
state [u] of Ml and let a be an input letter. Then, if <5o(<7, a) = p, it follows that p is 
associated with state [ua] of Ml because the input string ua maps sq to state p in Mq and 
maps sl to [ua] in Ml- Thus, the next-state functions of the two machines are identical 
up to a renaming of the states of the two machines. ■ 
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4.7.3 A State Minimization Algorithm 

The above approach does not offer a direct way to find a minimal-state machine. In this sec- 
tion we give a procedure for this purpose. Given a regular language, we construct an NFSM 
that recognizes it (Theorem 4.4.1) and then convert the NFSM to an equivalent DFSM (The- 
orem 4.2.1). Once we have such a DFSM M, we give a procedure to minimize the number of 
states based on combining equivalence classes of the right-invariant equivalence relation Rm 
that are indistinguishable. (These equivalence classes are sets of states of M.) The resulting 
machine is isomorphic to Ml, the minimal-state machine. 

DEFINITION 4.7.7 Let M = (£, Q, 5, s, F) be a DFSM. The equivalence relation =„ on states 
in Q is defined as follows: two states p and q ofiM are n-indistinguishable (denoted p =„ q) if 
and only if for all input strings u £ S* of length \u\ < n either both 5(p, u) and S(q, u) are in 
F or both are not in F. (We write p ^ n q if p and q are not n-indistinguishable.) Two states p 
and q are equivalent (denoted p = q) if they are n-indistinguishable for alln > 0. 

For arbitrary states q\, q%, and q$, if q\ and qx are n-indistinguishable and <?2 and q^ are 
n-indistinguishable, then q\ and q$ are n-indistinguishable. Thus, all three states are in the 
same set of the partition and = n is an equivalence relation. By an extension of this type of 
reasoning to all values of n, it is also clear that = is an equivalence relation. 

The following lemma establishes that =j+i refines =j and that for some k and all j > k, 
=j is identical to =fc, which is in turn equal to =. 

LEMMA 4.7. 1 Let M = (£, Q, 5, s, F) be an arbitrary DFSM. Over the set Q the equivalence 
relation = n +\ is a refinement of the relation = n . Furthermore, if for some k < \Q\ — 2, =k+i 
and=k are equal, then so are=j+\ and=j for all j > k. Ln particular, =k and= are identical. 

Proof If p = n +i q then p = n q by definition. Thus, for n > = n +i refines =„. 

We now show that if =k+i and =k are equal, then =j+i and =j are equal for all j >k. 
Suppose not. Let I be the smallest value of j for which =j+\ and = j are equal but =j+2 and 
=j+i are not equal. It follows that there exist two states p and q that are indistinguishable 
for input strings of length / + 1 or less but are distinguishable for some input string V of 
length \v\ = 1 + 2. Lett) = ait where a G S and |it| = l + l. Since 5{p,v) = 5(5(p,a),u) 
and d(q, v) = S(S(q, a), u), it follows that the states 5{p, a) and 5{q, a) are distinguishable 
by some string u of length I + 1 but not by any string of length /. But this contradicts the 
assumption that =;+i and =/ are equal. 

The relation = has two equivalence classes, the final states and all other states. For each 
integer j < k, where k is the smallest integer such that =k+i and =k are equal, =j has at 
least one more equivalence class than does =j-i- That is, it has at least j + 2 classes. Since 
=k can have at most \Q\ equivalence classes, it follows that k + 2 < \Q\. 

Clearly, =k and = are identical because if two states cannot be distinguished by input 
strings of length k or less, they cannot be distinguished by input strings of any length. ■ 

The proof of this lemma provides an algorithm to compute the equivalence relation =, 
namely, compute the relations =j, < j < \Q\ — 2 in succession until we find two relations 
that are identical. We find =j+i from =j as follows: for every pair of states (p, q) in an 
equivalence class of =j, we find their successor states 5(p,a) and S(q,a) under input letter 
a for each such letter. If for all letters a, S(p,a) =j 5(q,a) and p =j q, then p =j+i q 
because we cannot distinguish between p and q on inputs of length j ' + 1 or less. Thus, the 
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algorithm compares each pair of states in an equivalence class of =j and forms equivalence 
classes of =j+i by grouping together states whose successors under input letters are in the 
same equivalence class of =j . 

To illustrate these ideas, consider the DFSM of Fig. 4.14. The equivalence classes of =o are 
{{so> Qr}> {qi' Q2> ft}}- Since 5(s , 0) and 5(qn, 0) are different, s and Qr are in different 
equivalence classes of =1. Also, because S(q i , 0) = g^ and<5(gi, 0) = S(q2,0) = q\ G F,q$ is 
in a different equivalence class of =1 from q\ and q%. The latter two states are in the same equiv- 
alence class because 6(q u 1) = 5(q 2 , 1) = qn <£ F. Thus, =1= {{s }, {qn}, {93}, {qi,qi}}- 
The only one of these equivalence classes that could be refined is the last one. However, since 
we cannot distinguish between the two states in this class under any input, no further refine- 
ment is possible and = = =1. 

We now show that if two states are equivalent under =, they can be combined, but if they 
are distinguishable under =, they cannot. Applying this procedure provides a minimal-state 
DFSM. 

DEFINITION 4.7.8 Let M = (Y>,Q,5, s, F) be a DFSM and let = be the equivalence relation 
defined above over Q. The DFSM M= = (£, Q=, S=, [s], F=) associated with the relation = 
is defined as follows: a) the states Q= are the equivalence classes of=; b) the initial state ofM= 
is [s]; c) the final states F= are the equivalence classes containing states in F; d) for an arbitrary 
equivalence class [q] with representative element q G Q and an arbitrary input letter 06S, the 
next-state function 5= : Q= xSh Q= is defined by 5= ([q], a) = [S(q, a)]. 

This definition is consistent; no matter which representative of the equivalence class [q] is 
used, the next state on input a is [S(q, a)]. It is straightforward to show that M= recognizes 
the same language as does M . (See Problem 4.27.) We now show that M= is a minimal-state 
machine. 

THEOREM 4.7.2 M= is a minimal-state machine. 

Proof Let M = (Z,Q,5,s,F) be a DFSM recognizing L and let M= be the DFSM 
associated with the equivalence relation = on Q. Without loss of generality, we assume 
that all states of M= are accessible from the initial state. We now show that M= has no 
more states than Ml. Suppose it has more states. That is, suppose M= has more states 
than there are equivalence classes of Rl- Then, there must be two states p and q of M 
such that [p] =/= [q] but that uR^v, where u and v carry M from its initial state to p and 
q, respectively. (If this were not the case, any strings equivalent under Rl would carry M 
from its initial state s to equivalent states, contradicting the assumption that M= has more 
states than Ml-) But \{uRlV, then since Rl is right-invariant, uwRlvw for all w G E*. 
However, because [p] 7^ [q], there is some z G S* such that [p] and [q] can be distinguished. 
This is equivalent to saying that uzRlvz does not hold, a contradiction. Thus, M= and 
Ml have the same number of states. Since M= recognizes L, it is a minimal-state machine 
equivalent to M. ■ 

As shown above, the equivalence relation = for the DFSM of Fig. 4.14 is = is {{so}, 
{qr}> {$3}' {?i> ft}}- The DFSM associated with this relation, M=, is shown in Fig. 4.21. 
It clearly recognizes the language 10* + 0. It follows that the equivalent DFSM of Fig. 4.14 is 
not minimal. 
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Figure 4.2 I A minimal-state DFSM equivalent to the DFSM in Fig. 4.14. 



4.8 Pushdown Automata 

The pushdown automaton (PDA) has a one-way, read-only, potentially infinite input tape on 
which an input string is written (see Fig. 4.22); its head either advances to the right from the 
leftmost cell or remains stationary. It also has a stack, a storage medium analogous to the stack 
of trays in a cafeteria. The stack is a potentially infinite ordered collection of initially blank 
cells with the property that data can be pushed onto it or popped from it. Data is pushed onto 
the top of the stack by moving all existing entries down one cell and inserting the new element 
in the top location. Data is popped by removing the top element and moving all other entries 
up one cell. The control unit of a pushdown automaton is a finite-state machine. The full 
power of the PDA is realized only when its control unit is nondeterministic. 

DEFINITION 4.8. 1 A pushdown automaton (PDA) is a six-tuple M = (E,T, Q, A, s, F), 

where E is the tape alphabet containing the blank symbol /3, T is the stack alphabet containing 
the blank symbol^, Q is the finite set of 'states, A C (Qx (EU{e}) x (ru{e}) xQx (ru{e})) 
is the set of transitions, s is the initial state, and F is the set of final states. We now describe 
transitions. 

If for state p, tape symbol x, and stack symbol y the transition (p,x,y;q,z) G A, then ifM 
is in state p,i6S is under its tape head, and y G T is at the top of its stack, M may pop y from 
its stack, enter state q G Q, and push z G T onto its stack. However, if x = e, y = e or z = e, 
then M does not read its tape, pop its stack or push onto its stack, respectively. The head on the tape 
either remains stationary if x = e or advances one cell to the right ifx ^ e. 

If at each point in time a unique transition (p, x, y; q, z) may be applied, the PDA is deter- 
ministic. Otherwise it is nondeterministic. 

The PDA M accepts the input string w G S* if when started in state s with an empty 
stack (its cells contain the blank stack symbol -f) and w placed left-adjusted on its otherwise blank 
tape (its blank cells contain the blank tape symbol f3), the last state entered by Ad after reading 
the components of w and no other tape cells is a member of the set F . M accepts the language 
L(M ) consisting of all such strings. 

Some of the special cases for the action of the PDA M on empty tape or stack sym- 
bols are the following: if (p,x,e;q,z), x is read, state q is entered, and z is pushed onto 
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Figure 4.22 The control unit, one-way input tape, and stack of a pushdown automaton. 



the stack; if (p,x,y;q,e), x is read, state q is entered, and y is popped from the stack; 
if (p,e,y; q,z), no input is read, y is popped, z is pushed and state q is entered. Also, if 
(p, e, e; q, e), M moves from state p to q without reading input, or pushing or popping the 
stack. 

Observe that if every transition is of the form (p, x,e;q,e), the PDA ignores the stack and 
simulates an FSM. Thus, the languages accepted by PDAs include the regular languages. 

We emphasize that a PDA is nondeterministic if for some state q, tape symbol x, and top 
stack item y there is more than one transition that M can make. For example, if A contains 
(s, a, e; s, a) and (s, a, a; r, e), M has the choice of ignoring or popping the top of the stack 
and of moving to state s or r. If after reading all symbols of w M enters a state in F, then M 
accepts iv. 

We now give two examples of PDAs and the languages they accept. The first accepts 
palindromes of the form {wcw }, where w is the reverse of w and w G {a, b}* . The state 
diagram of its control unit is shown in Fig. 4.23. The second PDA accepts those strings over 
{a, 6} of the form a n b m for which n > m. 

EXAMPLE 4.8. 1 The PDA M = (E, T, Q, A, s, F), where E = {a, b, c, /3}, T = {a, b, 7}, 

Q = {s,p,r, /}, F = {/} and A contains the transitions shown in Fig. 4.24, accepts the 
language L = {wcw R }. 

The PDA M of Figs. 4.23 and 4.24 remains in the stacking state s while encountering 
as and 6's on the input tape, pushing these letters (the order of these letters on the stack is the 
reverse of their order on the input tape) onto the stack (Rules (a) and (b)). If it encounters an 
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Figure 4.23 State diagram for the pushdown automaton of Fig. 4.24 which accepts {wciv }. 
An edge label a, b; c between states p and q corresponds to the transition (p, a, b; q, c). 



instance of letter c while in state s, it enters the possible accept state p (Rule (c)) but enters 
the reject state r if it encounters a blank on the input tape (Rule (d)). While in state p it 
pops an a or b that matches the same letter on the input tape (Rules (e) and (f)). If the PDA 
discovers blank tape and stack symbols, it has identified a palindrome and enters the accept 
state / (Rule (g)). On the other hand, if while in state p the tape symbol and the symbol on 
the top of the stack are different or the letter c is encountered, the PDA enters the reject state 
r (Rules (h)-(n)). Finally, the PDA does not exit from either the reject or accept states (Rules 
(o) and (p)). 





Rule 


Comment 


(a) 


(s, a, e; s, a) 


push a 


(b) 


(s, b, e; s, b) 


push b 


(c) 


(s, c,e;p,e) 


accept? 


(d) 


(s,/3,e; r, e) 


reject 


(e) 


(p,a, a;p, e) 


accept? 


(/) 


(p, b, b; p, e) 


accept? 


(9) 


(p,p,r>f> e ) 


accept 


(h) 


(p,a, b; r, e) 


reject 





Rule 


Comment 


(i) 


[p, b, a; r, e) 


reject 


U) 


(p,/3,a;r,e) 


reject 


(k) 


(p, f3, b; r, e) 


reject 


(I) 


(p,a,r,r,e) 


reject 


(m) 


(p,b,r,r,e) 


reject 


(n) 


(p,c,e;r,e) 


reject 


(o) 


(r, e, e; r, e) 


stay in reject state 


(P) 


C/Ue;/,e) 


stay in accept state 



Figure 4.24 Transitions for the PDA described by the state diagram of Fig. 4.23. 
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Rule 


Comment 


(a) 


(s,/3,e;f,e) 


accept 


(b) 


(s, a, e; s, a) 


push a 


(c) 


(s,6,7;r,e) 


reject 


(d) 


(s, 6, a;p, e) 


pop a, enter pop state 


(e) 


(p,b,a;p,e) 


pop a 


(/) 


(p, b, 7; r, e) 


reject 





Rule 


Comment 


(5) 


(p,P,a;f,e) 


accept 


W 


(p,P,rJ,e) 


accept 


(0 


(p,a,e;r,e) 


reject 


CO 


(/.e,e;/,e) 


stay in accept state 


(k) 


(r, e, e; r, e) 


stay in reject state 



Figure 4.2S Transitions for a PDA that accepts the language {a n b m \n > m > 0}. 



EXAMPLE 4.8.2 The PDA M = (Y,,T,Q,A,s,F), where S = {a, &,/?}, T = {0,6,7}, 
Q = {s,p, r, /}, F = {/} and IS. contains the transitions shown in Fig. 4.25, accepts the 
language L = {a n b m \ n > m > 0}. 77?^ i&zfc diagram for this machine is shown in Fig. 4.26. 

The rules of Fig. 4.25 work as follows. An empty input in the stacking state s is accepted 
(Rule (a)). If a string of as is found, the PDA remains in state s and the a's are pushed onto 
the stack (Rule (b)). At the first discovery of a b in the input while in state s, if the stack is 
empty, the input is rejected by entering the reject state (Rule (c)). If the stack is not empty, 
the a at the top is popped and the PDA enters the pop state p (Rule (d)). If while in p a b 
is discovered on the input tape when an a is found at the top of the stack (Rule(e)), the PDA 
pops the a and stays in this state because it remains possible that the input contains no more b's 
than a's. On the other hand, if the stack is empty when a b is discovered, the PDA enters the 
reject state (Rule (f)). If in state p the PDA discovers that it has more a's than 6's by reading 



5, a; e 



Start 




Figure 4.26 The state diagram for the PDA defined by the tables in Fig. 4.25. 
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the blank tape letter [3 when the stack is not empty, it enters the accept state / (Rule (g)). If 
the PDA encounters an a on its input tape when in state p, an a has been received after a b 
and the input is rejected (Rule (i)). After the PDA enters either the accept or reject states, it 
remains there (Rules (j) and (k)). 

In Section 4.12 we show that the languages recognized by pushdown automata are exactly 
the languages defined by the context-free languages described in the next section. 



4.9 Formal Languages 

Languages are introduced in Section 1.2.3. A language is a set of strings over a finite set E, 
with |S| > 2, called an alphabet. S* is the language of all strings over £ including the empty 
string e, which has zero length. The empty string has the property that for an arbitrary string 
iv, ew = w = we. E + is the set S* without the empty string. 

In this section we introduce grammars for languages, rules for rewriting strings through 
the substitution of substrings. A grammar consists of alphabets T and M of terminal and 
non-terminal symbols, respectively, a designated non-terminal start symbol, plus a set of rules 
7Z for rewriting strings. Below we define four types of language in terms of their grammars: 
the phrase-structure, context-sensitive, context-free, and regular grammars. 

The role of grammars is best illustrated with an example for a small fragment of English. 
Consider a grammar G whose non-terminals M contain a start symbol S denoting a generic 
sentence and NP and VP denoting generic noun and verb phrases, respectively. In turn, assume 
that M also contains non-terminals for adjectives and adverbs, namely AJ and AV. Thus, M = 
{s, NP, VP, AJ, AV, N, v}. We allow the grammar to have the following words as terminals: 
T = {bob, alice, duck, big, smiles, quacks, loudly}. Here bob, alice, and duck are nouns, 
big is an adjective, smiles and quacks are verbs, and loudly is an adverb. In our fragment of 
English a sentence consists of a noun phrase followed by a verb phrase, which we denote by the 
rule S — > NP VP. This and the other rules 1Z of the grammar are shown below. They include 
rules to map non-terminals to terminals, such as N — > bob 

S — > NP VP N — > bob V — > smiles 



NP - 


-> N 


N - 


-> alice 


V 


-> quacks 


NP - 


-> AJ N 


N - 


-» duck 


AV - 


-» loudly 


VP - 


-> V 


AJ - 


-> big 







VP — > V AV 

With these rules the following strings (sentences) can be generated: bob smiles; big duck 
quacks loudly; and alice quacks. The first two sentences are acceptable English sentences, 
but the third is not if we interpret alice as a person. This example illustrates the need for rules 
that limit the rewriting of non-terminals to an appropriate context of surrounding symbols. 

Grammars for formal languages generalize these ideas. Grammars are used to interpret 
programming languages. A language is translated and given meaning through a series of steps 
the first of which is lexical analysis. In lexical analysis symbols such as a, I, i, c, e are grouped 
into tokens such as alice, or some other string denoting alice. This task is typically done with 
a finite-state machine. The second step in translation is parsing, a process in which a tokenized 
string is associated with a series of derivations or applications of the rules of a grammar. For 
example, big duck quacks loudly, can be produced by the following sequence of derivations: 
S — > NP VP; NP — > AJ N; AJ — > big; N — > duck; VP — > V AV; V — > quacks; AV — > loudly. 
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In his exploration of models for natural language, Noam Chomsky introduced four lan- 
guage types of decreasing expressibility, now called the Chomsky hierarchy, in which each 
language is described by the type of grammar generating it. These languages serve as a basis for 
the classification of programming languages. The four types are the phrase-structure languages, 
the context-sensitive languages, the context-free languages, and the regular languages. 

There is an exact correspondence between each of these types of languages and particular 
machine architectures in the sense that for each language type T there is a machine architecture 
A recognizing languages of type T and for each architecture A there is a type T such that all 
languages recognized by A are of type T. The correspondence between language and architec- 
ture is shown in the following table, which also lists the section or problem where the result is 
established. Here the linear bounded automaton is a Turing machine in which the number 
of tape cells that are used is linear in the length of the input string. 



Level 



Language Type 



phrase-structure 
context-sensitive 
context-free 
regular 



Machine Type 



Turing machine 
linear bounded automaton 
nondet. pushdown automaton 
finite-state machine 



Proof Location 



Section 5.4 
Problem 4.36 
Section 4.12 
Section 4.10 



We now give formal definitions of each of the grammar types under consideration. 

4.9.1 Phrase-Structure Languages 

In Section 5.4 we show that the phrase-structure grammars defined below are exactly the lan- 
guages that can be recognized by Turing machines. 

DEFINITION 4.9. 1 A phrase-structure grammar G is a four-tuple G = (J\f,T,7Z, s) where 
M andT are disjoint alphabets of non-terminals and terminals, respectively. Let V = TV U T. 
The rules 1Z form a finite subset of V + x V* (denoted 1Z C V + x V*) where for every rule 
(a, b) € 1Z, a contains at least one non-terminal symbol. The symbol S € M is the start symbol. 

Lf(a, b) (z 1Z we write a — > b. Lf u 6 V + and a is a contiguous substring ofu, then u can 
be replaced by the string v by substituting b for a. Lfthis holds, we write it =^g v and call it an 
immediate derivation. Extending this notation, if through a sequence of immediate derivations 
(called a derivation,) u =>g X\, X\ =^g x 2, • * • > x n =^G v tve can transform u to v, we 
write u=>c» and say that v derives from u. Lfthe rules 7Z contain (a, a) for all a G A/" + , the 
relation =>g is called the transitive closure of the relation =>g and u =>g u for all u G V* 
containing at least one non-terminal symbol. 

The language L(G) defined by the grammar G is the set of all terminal strings that can be 
derived from the start symbol S; that is, 

L(G) = {ueT*\s^ G u} 



When the context is clear we drop the subscript G in =^g and =>g- These definitions are 
best understood from an example. In all our examples we use letters in SMALL CAPS to denote 
non-terminals and letters in italics to denote terminals, except that e, the empty letter, may 
also be a terminal. 
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EXAMPLE 4.9.1 Consider the grammar G i = (Af u T u Tl u S), where M x = {s,B, C}, % = 
{a, 6, c] andlZi consists of the following rules: 

a) s — > aSBC d) ah — > a6 (?) cC — > cc 

6) S — > aBC e) 6b — > 66 

c) CB — > BC /) 6c — > 6c 

Clearly the string aaBCBC can be rewritten as aaBBCC using rule (c), that is, aaBCBC =>• 
aaBBCC. One application of (d), one of (e), one of (f), and one of (g) reduces it to the string 
aabbcc. Since one application of (a) and one of (b) produces the string aaBBCC, it follows 
that the language L(G\) contains aabbcc. 

Similarly, two applications of (a) and one of (b) produce aaaBCBCBC, after which three 
applications of (c) produce the string aaaBBBCCC. One application of (d) and two of (e) 
produce aaabbbCCC, after which one application of (f) and two of (g) produces aaabbbccc. 
In general, one can show that L(G\) = {a n b n c n \n> 1}. (See Problem 4.38.) 

4.9.2 Context-Sensitive Languages 

The context-sensitive languages are exactly the languages accepted by linear bounded automata, 
nondeterministic Turing machines whose tape heads visit a number of cells that is a constant 
multiple of the length of an input string. (See Problem 4.36.) 

DEFINITION 4.9.2 A context-sensitive grammar G is a phrase structure grammar G = (AT, 
T, 1Z, s) in which each rule (a,b) £ 1Z satisfies the condition that b has no fewer characters 
than does a, namely, \a\ < \b\. The languages defined by context-sensitive grammars are called 
context-sensitive languages (CSL). 

Each rule of a context-sensitive grammar maps a string to one that is no shorter. Since the 
left-hand side of a rule may have more than one character, it may make replacements based 
on the context in which a non-terminal is found. Examples of context-sensitive languages are 
given in Problems 4.38 and 4.39. 

4.9.3 Context-Free Languages 

As shown in Section 4.12, the context-free languages are exactly the languages accepted by 
pushdown automata. 

DEFINITION 4.9.3 A context-free grammar G = (J\f,T,7Z,s) is a phrase structure grammar 
in which each rule in TZ C M x V * has a single non-terminal on the left-hand side. The languages 
defined by context-free grammars are called context-free languages (CFL). 

Each rule of a context-free grammar maps a non-terminal to a string over V* without 
regard to the context in which the non-terminal is found because the left-hand side of each 
rule consists of a single non-terminal. 

EXAMPLE 4.9.2 LetMn = {s, a}, T 2 = {e,a,6}, andK 2 = {s -> as6, S -> e}. Then the 
grammar G2 = (A/2, Ti, TZi, s) is context-free and generates the language L(G2) = {a n b n j n > 
0}. To see this, let the rule S — > aSb be applied k times to produce the string a Sb . A final 
application of the last rule establishes the result. 
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EXAMPLE 4.9.3 Consider the grammar G3 with the following rules and the implied terminal and 
non-terminal alphabets: 

a) S — > cMcNc d) N — > bNb 

b) M — > aMa e) N — > c 

c) M — > c 

G3 « context-free and generates the language L{G$) = {ca n ca n cb m cb m c | n, m > 0}, as is 
1 shown. 



Context-free languages capture important aspects of many programming languages. As 
a consequence, the parsing of context-free languages is an important step in the parsing of 
programming languages. This topic is discussed in Section 4.11. 

4.9.4 Regular Languages 

DEFINITION 4.9.4 A regular grammar G is a context-free grammar G = (TV, T, 1Z, s), where 
the right-hand side is either a terminal or a terminal followed by a non-terminal. That is, its rules 
are of the form A — > awA-> be. The languages defined by regular grammars are called regular 
languages. 

Some authors define a regular grammar to be one whose rules are of the form A — > a 
or A — > b\b 2 ■ ■ ■ bkC. It is straightforward to show that any language generated by such a 
grammar can be generated by a grammar of the type defined above. 

The following grammar is regular. 

EXAMPLE 4.9.4 Consider the grammar G 4 = (A/4, T4, Tl A , s) where Ma = {s,A, b}, T^ = 
{0,1} WK4 consists of the rules given below. 

a) S — > 0A d) B — > 0A 

b) S -> e) B -> 

c) A — > IB 

It is straightforward to see that the rules a) S — > 0, b) S — > OlB, c) B — > 0, and d) B —* OlB 
generate the same strings as the rules given above. Thus, the language G4 contains the strings 
0,010,01010,0101010,..., that is, strings of the form (01) fe for k > 0. Consequently 
L(Gi) = (01)*0. A formal proof of this result is left to the reader. (See Problem AAA.) 



4.10 Regular Language Recognition 



As explained in Section 4.1, a deterministic finite-state machine (DFSM) M is a five-tuple 
M = (£, Q, 5, s, F), where £ is the input alphabet, Q is the set of states, S : Q x S 1— > Q is 
the next-state function, s is the initial state, and F is the set of final states. A nondeterministic 
FSM (NFSM) is similarly defined except that 5 is a next-set function S : Q x E 1— > 2® . In 
other words, in an NFSM there may be more than one next state for a given state and input. 
In Section 4.2 we showed that the languages recognized by these two machine types are the 
same. 

We now show that the languages L(G) and L(G) U {e} defined by regular grammars G 
are exactly those recognized by FSMs. 
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THEOREM 4. 1 0. 1 The languages L(G) and L(G) U {e} generated by regular grammars G and 
recognized by finite-state machines are the same. 

Proof Given a regular grammar G, we construct a corresponding NFSM M that accepts 
exactly the strings generated by G. Similarly, given a DFSM M we construct a regular 
grammar G that generates the strings recognized by M. 

From a regular grammar G = (TV, T, 1Z, s) with rules 1Z of the form A — > a and 
A — > be we create a grammar G' generating the same language by replacing a rule A — > a 
with rules A — > ah and B — > e where B is a new non-terminal unique to A — > a. Thus, 
every derivation S =>g ii;, id £ T*, now corresponds to a derivation S =>G' w B where 
B — » e. Hence, the strings generated by G and G" are the same. 

Now construct an NFSM Mg> whose states correspond to the non-terminals of this new 
regular grammar and whose input alphabet is its set of terminals. Let the start state of Mqi 
be labeled S. Let there be a transition from state A to state B on input a if there is a rule 
A — > aB in G . Let a state B be a final state if there is a rule of the form B — > e in G' . 
Clearly, every derivation of a string w in L(G') corresponds to a path in M that begins in 
the start state and ends on a final state. Hence, w is accepted by Mqi . On the other hand, 
if a string w is accepted by Mqi , given the one-to-one correspondence between edges and 
rules, there is a derivation of w from S in G". Thus, the strings generated by G and the 
strings accepted by Mq' are the same. 

Now assume we are given a DFSM M that accepts a language Lm- Create a grammar 
Gm whose non-terminals are the states of M and whose start symbol is the start state of M. 
Gm has a rule of the form q\ — > aqi if M makes a transition from state 51 to qi on input 
a. If state q is a final state of M, add the rule q — > e. If a string is accepted by M, that is, it 
causes M to move to a final state, then Gm generates the same string. Since Gm generates 
only strings of this kind, the language accepted by M is is L(Gm)- Now convert Gm to 
a regular grammar Gm by replacing each pair of rules q\ — > aq2 , 52 - ► e by the pair 
q\ — > a<72, <Zi — > &, deleting all rules g — > e corresponding to unreachable final states g, 
and deleting the rule S — > e if e g Lj\/. Then, Lm — {e} = L(Gm) — { e } = L(Gm)- ■ 



Start 




Figure 4.27 A nondeterministic FSM that accepts a language generated by a regular language in 
which all rules are of the form A — » be or A — > e. A state is associated with each non-terminal, the 
start symbol S is associated with the start state, and final states are associated with non-terminals 
A such that A — > e. This particular NFSM accepts the language L(Gi) of Example ASA. 
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A simple example illustrates the construction of an NFSM from a regular grammar. Con- 
sider the grammar G4 of Example 4.9.4. A new grammar G' 4 is constructed with the following 
rules: a) S — > OA, b) S — > OC, c) C — > e, d) A — > IB, e) B — > OA, f) B — > OD, and g) D — > e. 
Figure 4.27 (page 185) shows an NFSM that accepts the language generated by this gram- 
mar. A DFSM recognizing the same language can be obtained by invoking the construction of 
Theorem 4.2.1. 



4.11 Parsing Context-Free Languages 



Parsing is the process of deducing those rules of a grammar G (a derivation) that generates a 
terminal string w. The first rule must have the start symbol S on the left-hand side. In this 
section we give a brief introduction to the parsing of context-free languages, a topic central 
to the parsing of programming languages. The reader is referred to a textbook on compilers 
for more detail on this subject. (See, for example, [11] and [99].) The concepts of Boolean 
matrix multiplication and transitive closure are used in this section, topics that are covered in 
Chapter 6. 

Generally a string w has many derivations. This is illustrated by the context-free grammar 
G) defined in Example 4.9.3 and described below. 



EXAMPLE 4. 1 1 . 1 G 3 = (.A/3, 


% 


,"£.3, S), where M^ = 


{S,M, N}, 73 = 


= {A, B, c} and 11 


consists of the rules below: 










a) S — > cMNc 




d) N — > bub 






b) M — > aMa 




e) N — > c 






c) M — > c 











The string caacaabcbc can be derived by applying rules (a), (b) twice, (c), (d) and (e) to 
produce the following derivation: 

S =^> cMNc =4> caMaNc =>• ca 2 Ma 2 Nc 

(4.2) 
=>■ ca 2 ca 2 Nc =>■ ca 2 ca 2 bNbc =>• ca 2 ca 2 bcbc 

The same string can be obtained by applying the rules in the following order: (a), (d), (e), 
(b) twice, and (c). Both derivations are described by the parse tree of Fig. 4.28. In this tree 
each instance of a non-terminal is rewritten using one of the rules of the grammar. The order 
of the descendants of a non-terminal vertex in the parse tree is the order of the corresponding 
symbols in the string obtained by replacing this non-terminal. The string ca ca bcbc, the 
yield of this parse tree, is the terminal string obtained by visiting the leaves of this tree in a 
left-to-right order. The height of the parse tree is the number of edges on the longest path 
(having the most edges) from the root (associated with the start symbol) to a terminal symbol. 
A parser for a language L(G) is a program or machine that examines a string and produces a 
derivation of the string if it is in the language and an error message if not. 

Because every string generated by a context-free grammar has a derivation, it has a cor- 
responding parse tree. Given a derivation, it is straightforward to convert it to a leftmost 
derivation, a derivation in which the leftmost remaining non-terminal is expanded first. (A 
rightmost derivation is a derivation in which the rightmost remaining non-terminal is ex- 
panded first.) Such a derivation can be obtained from the parse tree by deleting all vertices 
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Figure 4.28 A parse tree for the grammar G3 



associated with terminals and then traversing the remaining vertices in a depth-first manner 
(visit the first descendant of a vertex before visiting its siblings), assuming that descendants of 
a vertex are ordered from left to right. When a vertex is visited, apply the rule associated with 
that vertex in the tree. The derivation given in (4.2) is leftmost. 

Not only can some strings in a context-free language have multiple derivations, but in 
some languages they have multiple parse trees. Languages containing strings with more than 
one parse tree are said to be ambiguous languages. Otherwise languages are non-ambiguous. 

Given a string that is believed to be generated by a grammar, a compiler attempts to parse 
the string after first scanning the input to identify letters. If the attempt fails, an error message 
is produced. Given a string generated by a context-free grammar, can we guarantee that we can 
always find a derivation or parse tree for that string or determine that none exists? The answer 
is yes, as we now show. 

To demonstrate that every CFL can be parsed, it is convenient first to convert the grammar 
for such a language to Chomsky normal form. 

DEFINITION 4. 1 I . I A context-free grammar G is in Chomsky normal form if every rule is of 
the form A — > BC or A — > u, u G T except ife G L(G), in which case S — > e is also in the 
grammar. 

We now give a procedure to convert an arbitrary context-free grammar to Chomsky normal 
form. 

THEOREM 4.1 I.I Every context-free language can be generated by a grammar in Chomsky normal 
form. 

Proof Let L = L(G) where G is a context-free grammar. We construct a context-free gram- 
mar G' that is in Chomsky normal form. The process described in this proof is illustrated 
by the example that follows. 

Initially G is identical with G. We begin by eliminating all e-rules of the form B — ► e. 
except for S — ► e if e G L(G). If either B — ► e or B => e, for every rule that has B on the 
right-hand side, such as A — > aB/3B7, a, f3, 7 G (V - {b})* (V = JV U T), we add a rule 
for each possible replacement of B by e; for example, we add A — ► a/3B7, A — > aB/37, 



188 Chapter 4 Finite-State Machines and Pushdown Automata Models of Computation 

and A — » a/37. Clearly the strings generated by the new rules are the same as are generated 
by the old rules. 

Let A — ► W\ ■ ■ ■ Wi ■ ■ ■ Wk for some k > 1 be a rule in G where w% G V . We replace 
this rule with the new rules A — > Z\Zi ■ ■ ■ Z],, and Zj — » w% for 1 < i < k. Here Zi is a 
new non-terminal. Clearly, the new version of G' generates the same language as does G. 

With these changes the rules of G" consist of rules either of the form A — > u, u G T 
(a single terminal) or A — > w, w G A/" + (a string of at least one non-terminal). There are 
two cases of w G A/" + to consider, a) \w\ = 1 and b) |«;| > 2. We begin by eliminating all 
rules of the first kind, that is of the form A — > B. 

Rules of the form A — > B can be cascaded to form rules of the type C =£- D. The number 
of distinct derivations of this kind is at most |A/"|! because if any derivation contains two 
instances of a non-terminal, the derivation can be shortened. Thus, we need only consider 
derivations in which each non-terminal occurs at most once. For each such pair C, D with 
a relation of this kind, add the rule C — > D to G'. If C — > D and D — > w for \w\ > 2 or 
w = u G T, add C — > it) to the set of rules. After adding all such rules, delete all rules of 
the form A — > B. By construction this new set of rules generates the same language as the 
original set of rules but eliminates all rules of the first kind. 

We now replace rules of the type A — > AjA 2 ■ • • Aj,, k > 3. Introduce k — 1 new 
non-terminals Nj, N2, ■ • • , Nfc_2 peculiar to this rule and replace the rule with the following 
rules: A — > A1N1, Ni — > A 2 N 2 , ■ • • , Nfc_ 3 — » A fc _2N fc _ 2 , N fc -2 —> Afe.iAfc. Clearly, the 
new grammar generates the same language as the original grammar and is in the Chomsky 
normal form. ■ 

EXAMPLE 4.1 1.2 Let G5 = (A/5, 7^, TZ^, e) (with start symbol E) be the grammar with A/5 = 
{e, T, f}, 75 = {a, b, +, *, (, )}, andlZc, consisting of the rules given below: 



a) E — > E + T 


d) T - 


-» F 


/) F 


6) E —* T 


e) F - 


- (E) 


5) F 


c) T — > T * F 









Here E, T, and F denote expressions, terms, and factors. It is straigh forward to show that E =>■ (a * 
6 + a) * (a + b) and E =£- a * 6 + a rfre two possible derivations. 

We convert this grammar to the Chomsky normal form using the method described in the 
proof of Theorem 4.11.1. Since 1Z contains no e-rules, we do not need the rule E — > e, nor 
do we need to eliminate e-rules. 

First we convert rules of the form A — > w so that each entry in w is a non-terminal. To 
do this we introduce the non-terminals (, ), +, and * and the rules below. Here we use a 
boldface font to distinguish between the non-terminal and terminal equivalents of these four 
mathematical symbols. Since we are adding to the original set of rules, we number them 
consecutively with the original rules. 

h) ( -» ( 3) * - + 

i) ) — > ) fc) * — > * 

Next we add rules of the form C — > D for all chains of single non-terminals such that 
C=>D. Since by inspection E => F, we add the rule E — ► F. For every rule of the form A — ► B 
for which B — ► w, we add the rule A — > w. We then delete all rules of the form A — > B. These 
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changes cause the rules of G' to become the following. (Below we use a different numbering 
scheme because all these rules replace rules (a) through (k).) 

- ( 



1) E 


-» E+T 


7) 


T - 


- (E) 


13) 


( 


2) E - 


-» T*F 


8) 


T - 


-> a 


14) 


) 


3) E - 


+ (E) 


9) 


T - 


-> 6 


15) 


+ 


4) E - 


-> a 


10) 


F - 


- (E) 


16) 


* 


5) e 


-+ 6 


ID 


F - 


-» a 






6) T - 


-> T*F 


12) 


F - 


-> b 







We now reduce the number of non-terminals on the right-hand side of each rule to two 
through the addition of new non-terminals. The result is shown in Example 4.11.3 below, 
where we have added the non-terminals A, B, C, D, G, and H. 

EXAMPLE 4.1 1.3 Let G& = (A/e, T&, TZg> e) (with start symbol E) be the grammar with M& = 
{a, B, C, D, E, F, G, H, T, +, *, (, )}, 7g = {a, b, +, *, (, )}, andlZ^ consisting of the rules given 
below. 



(A) 


E 


— > EA 


(J) 


T - 


-> TD 


(Q) 


H - 


-> E) 


(B) 


A 


— > +T 


(J) 


D - 


-» *F 


(i?) 


F - 


-> a 


(C) 


E 


— > TB 


(K) 


T - 


■* (G 


(5) 


F - 


-> 6 


(D) 


B 


— > *F 


(L) 


G - 


■* E) 


(T) 


( 


- ( 


(E) 


E 


-> (c 


(M) 


T - 


-> a 


(£/) 


) " 


- ) 


(F) 


C 


-> E) 


(N) 


T - 


■* 6 


(V) 


+ 


->• + 


(G) 


E 


— > a 


(P) 


F - 


■* (H 


(w) 


* 


-> * 


(H) 


E 


-> b 















The new grammar clearly generates the same language as does the original grammar, but it 
is in Chomsky normal form. It has 22 rules, 13 non-terminals, and six terminals whereas the 
original grammar had seven rules, three non-terminals, and six terminals. 

We now use the Chomsky normal form to show that for every CFL there is a polynomial- 
time algorithm that tests for membership of a string in the language. This algorithm can be 
practical for some languages. 

THEOREM 4. 1 1.2 Given a context-free grammar G = (Af,T,TZ,S), an 0(n 3 \Af\ 2 )-step algo- 
rithm exists to determine whether or not a string w G T* of length n is in L(G) and to construct 
a parse tree for it if it exists. 

Proof If G is not in Chomsky normal form, convert it to this form. Given a string w = 
(w\, W2, • . • , w n ), the goal is to determine whether or not S =>■ w. Let denote the empty 
set. The approach taken is to construct an (n + 1) X (n + 1) set matrix S whose entries 
are sets of non-terminals of G with the property that the i,j entry, dij, is the set of non- 
terminals C such that C =>- w, • • • Wj-\. Thus, the string w is in L(G) if S G fli in+ i, since 
S generates the entire string w. Clearly, a^j = for j < i. We illustrate this construction 
with the example following this proof. 

We show by induction that set matrix S is the transitive closure (denoted B + ) of the 
(n + 1) X {n + 1) set matrix B whose i, j entry bij = for j ^ i + 1 when 1 < i < n 
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and b ii+ i is defined as follows: 

bi,i+l = { A I ( A -* w i) in Tl where Wi G T} 



B 



'1,2 

b 



2,3 



Vn,n 



+ 1 



Thus, the entry &j,j+i is the set of non-terminals that generate the ith terminal symbol Wi 
of u; in one step. The value of each entry in the matrix B is the empty set except for the 
entries 6,,,+i for 1 < i < n, n = \w\. 

We extend the concept of matrix multiplication (see Chapter 6) to the product of two 
set matrices. Doing this requires a new definition for the product of two sets (entries in the 
matrix) as well as for the addition of two sets. The product S\ ■ S 2 of sets of nonterminals 
Si and S 2 is defined as: 

S\ ■ S 2 = {A I there exists B e Si and C e S 2 such that (a — > BC) € 11} 

Thus, Si ■ S 2 is the set of non-terminals for which there is a rule in 7Z of the form A — > BC 
where B6 5i and C £ S2. The sum of two sets is their union. 

The i, j entry of the product C = D x E of two m x m matrices D and E, each 
containing sets of non-terminals, is defined below in terms of the product and union of sets: 



^,3 



[J d lik ■ e k ,j 
fc=i 



We also define the transitive closure C + of an m x m matrix C as follows: 

c+ = C (1) U C (2) U C (3) U • ■ ■ c (m) 



^here 



C( s ) = (J C (r) x C (s - r) and C (1) = C 



By the definition of the matrix product, the entry b\ of the matrix B^ 2 > is if J ^ i + 2 
and otherwise is the set of non-terminals A that produce WiWi+i through a derivation tree 
of depth 2; that is, there are rules such that A — > BC, B — > Wi, and C — > Wi+i, which 
implies that A =^> WiWi+i. 

Similarly, it follows that both B^'B^ 2 ' and B^ Z 'B^ are in all positions except i, i + 3 
for 1 < % < n - 2. The entry in position i,i + 3 of B( 3) = BWbWIJS^BC) 
contains the set of non-terminals A that produce WiWi+iWi + 2 through a derivation tree of 
depth 3; that is, A — > BC and either B produces WiWi+i through a derivation of depth 2 
(B =^> iUjWj_|_i) and C produces Wi+2 in one step (C — > Wi+2) or B produces Wj in one step 
(B — > w^ and C produces Wi+iWi+2 through a derivation of depth 2 (C =>• Wi+iWi+2)- 
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Finally, the only entry in B^ n > that is not is the 1, n + 1 entry and it contains the set 
of non-terminals, if any, that generate w. If S is in this set, w is in L(G). 

The transitive closure S = B + involves 5^ r =i r = (n+l)n/2 products of set matrices. 
The product of two (n + 1) x (n + 1) set matrices of the type considered here involves at 
most n products of sets. Thus, at most 0{n?) products of sets is needed to form S. In turn, 
a product of two sets, S\ ■ S2, can be formed with 0(q 2 ) operations, where q = LA/] is the 
number of non-terminals. It suffices to compare each pair of entries, one from S\ and the 
other from S2, through a table to determine if they form the right-hand side of a rule. 

As the matrices are being constructed, if a pair of non-terminals is discovered that is the 
right-hand side of a rule, that is, A — > BC, then a link can be made from the entry A in the 
product matrix to the entries B and C. From the entry S in ai ]n +i, if it exists, links can be 
followed to generate a parse tree for the input string. ■ 

The procedure described in this proof can be extended to show that membership in an 
arbitrary CFL can be determined in time 0(M(n)), where M (n) is the number of operations 
to multiply two n x n matrices [342]. This is the fastest known general algorithm for this 
problem when the grammar is part of the input. For some CFLs, faster algorithms are known 
that are based on the use of the deterministic pushdown automaton. For fixed grammars 
membership algorithms often run in O(n) steps. The reader is referred to books on compilers 
for such results. The procedure of the proof is illustrated by the following example. 

EXAMPLE 4. 1 1 .4 Consider the grammar G& of Example 4.11.3. We show how the five-character 
string a*b + a in L{G&) can be parsed. We construct the6x 6 matrices B^, B^ 2 ', B^' , B^ A \ 
B^ ', as shown below. Since -B' ' contains E in the 1,71+1 position, a*b+ a is in the language. 
Furthermore, we can follow links between non-terminals (not shown) to demonstrate that this string 
has the parse tree shown in Fig. 4.29. The matrix i?' ' is not shown because each of its entries is 0. 
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* • • b 

Figure 4.29 The parse tree for the string a * b + a in the language L{Gg) 



B & 



{E} 



4.12 CFL Acceptance with Pushdown Automata* 

While it is now clear that an algorithm exists to parse every context-free language, it is useful 
to show that there is a class of automata that accepts exactly the context-free languages. These 
are the nondeterministic pushdown automata (PDA) described in Section 4.8. 

We now establish the principal results of this section, namely, that the context-free lan- 
guages are accepted by PDAs and that the languages accepted by PDAs are context-free. We 
begin with the first result. 

THEOREM 4. 12. 1 For each context-free grammar G there is a PDA M that accepts L(G). That 
is, L{M) = L(G). 

Proof Before beginning this proof, we extend the definition of a PDA to allow it to push 
strings onto the stack instead of just symbols. That is, we extend the stack alphabet T to 
include a small set of strings. When a string such as abed is pushed, a is pushed before b, b 
before c, etc. This does not increase the power of the PDA, because for each string we can 
add unique states that M enters after pushing each symbol except the last. "With the pushing 
of the last symbol M enters the successor state specified in the transition being executed. 

Let G = (Af, T, 1Z, s) be a context-free grammar. We construct a PDA M = (S, V, Q, 
A, s, F), where S = T, T = N U T U {7} (7 is the blank stack symbol), Q = {s,p, /}, 
F = {/}, and A consists of transitions of the types shown below. Here V denotes "for all" 
and V(A 1— > w) £ 1Z means for all transitions in 1Z. 
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a) (s,e,e;p, s) 

b) (p, a, a; p, e) Va e T 

c) (p,e,A;p,v) V(A (— > v) e TZ 

d) (P.e.75/.e) 

Let -u; be placed left-adjusted on the input tape of M. Since w is generated by G, it has 
a leftmost derivation. (Consider for example that given in (4.2) on page 186.) The PDA 
begins by pushing the start symbol S onto the stack and entering state p (Rule (a)). From 
this point on the PDA simulates a leftmost derivation of the string w placed initially on its 
tape. (See the example that follows this proof.) M either matches a terminal of G on the top 
of the stack with one under the tape head (Rule (b)) or it replaces a non-terminal on the top 
of the stack with a rule of 1Z by pushing the right-hand side of the rule onto the stack (Rule 
(c)). Finally, when the stack is empty, M can choose to enter the final state / and accept w. 
It follows that any string that can be generated by G can also be accepted by M and vice 
versa. ■ 

The leftmost derivation of the string caacaabcbc by the grammar G3 of Example 4.11.1 
is shown in (4.2). The PDA AI of the above proof can simulate this derivation, as we show. 
With the notation T : ... and S : ... (shown below before the computation begins) we 
denote the contents of the tape and stack at a point in time at which the underlined symbols 
are those under the tape head and at the top of the stack, respectively. We ignore the blank 
tape and stack symbols unless they are the ones underlined. 

T : caacaabcbc S : 7 

After the first step taken by AI, the tape and stack configurations are: 

T : caacaabcbc S : S 

From this point on M simulates a derivation by G3. Consulting (4.2), we see that the rule 
S — ► cMNc is the first to be applied. M simulates this with the transition (p, e, S; p, cMNc), 
which causes S to be popped from the stack and cMNc to be pushed onto it without advancing 
the tape head. The resulting configurations are shown below: 

T : caacaabcbc S : cMNc 

Next the transition (p, c, c; p, e) is applied to pop one item from the stack, exposing the non- 
terminal M and advancing the tape head to give the following configurations: 

T : caacaabcbc S : MNc 

The subsequent rules, in order, are the following: 

1) M — > aMa 3) M — > c 5) N — > c 

2) M — > aMa 4) N — > 6n6 

The corresponding transitions of the PDA are shown in Fig. 4.30. 

We now show that the language accepted by a PDA can be generated by a context-free 
grammar. 
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Figure 4.30 PDA transitions corresponding to the leftmost derivation of the string caacaabcbc 
in the grammar G3 of Example 4.11.1. 



THEOREM 4. 12.2 For each PDA M there is a context-free grammar G that generates the language 
L{M) accepted by M . That is, L{G) = L(M). 

Proof It is convenient to assume that when the PDA M accepts a string it does so with 
an empty stack. If M is not of this type, we can design a PDA M' accepting the same 
language that does meet this condition. The states of M' consist of the states of M plus 
three additional states, a new initial state s', a cleanup state k, and a new final state /'. Its 
tape symbols are identical to those of M. Its stack symbols consist of those of M plus one 
new symbol k. In its initial state M' pushes k onto the stack without reading a tape symbol 
and enters state s, which was the initial state of M. It then operates as M (it has the same 
transitions) until entering a final state of M, upon which it enters the cleanup state k. In 
this state it pops the stack until it finds the symbol K, at which time it enters its final state 
/'. Clearly, M' accepts the same language as M but leaves its stack empty. 

We describe a context-free grammar G = (J\f, T, 1Z, s) with the property that L(G) = 
L(M). The non-terminals of G consist of S and the triples < p,y,q > defined below 
denoting goals: 

<p,y,q > € M where M C Q x (T U {e}) x Q 

The meaning of < p,y,q > is that M moves from state p to state q in a series of steps 
during which its only effect on the stack is to pop y. The triple < p,e,q> denotes the goal 
of moving from state p to state q leaving the stack in its original condition. Since M starts 
with an empty stack in state s with a string w on its tape and ends in a final state / with 
its stack empty, the non-terminal < s,e, f >, / € F, denotes the goal of M moving from 
state s to a final state / on input w, and leaving the stack in its original state. 
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The rules of G, which represent goal refinement, are described by the following con- 
ditions. Each condition specifies a family of rules for a context-free grammar G. Each 
rule either replaces one non-terminal with another, replaces a non-terminal with the empty 
string, or rewrites a non-terminal with a terminal or empty string followed by one or two 
non-terminals. The result of applying a sequence of rules is a string of terminals in the 
language L(G). Below we show that L(G) = L(M). 
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Condition (1) specifies rules that map the start symbol of G onto the goal non-terminal 
symbol < s,e, / > for each final state /. These rules insure that the start symbol of G is 
rewritten as the goal of moving from the initial state of M to a final state, leaving the stack 
in its original condition. 

Condition (2) specifies rules that map non-terminals < p,e,p > onto the empty string. 
Thus, all goals of moving from a state to itself leaving the stack in its original condition can 
be ignored. In other words, no input is needed to take M from state p back to itself leaving 
the stack unchanged. 

Condition (3) specifies rules stating that for all r € Q and (p, x, y; q, z),y =/= e, that are 
transitions of M, a goal < p, y, r > to move from state p to state r while removing y from 
the stack can be accomplished by reading tape symbol x, replacing the top stack symbol 
y with z, and then realizing the goal < q, z,r > of moving from state q to state r while 
removing z from the stack. 

Condition (4) specifies rules stating that for all r,t G Q and (p,x,e;q, z) that are 
transitions of AI, the goal < p, u, r > of moving from state p to state r while popping u 
for arbitrary stack symbol U can be achieved by reading input x and pushing z on top of u 
and then realizing the goal < q,z,t > of moving from q to some state t while popping z 
followed by the goal < t,u,r > of moving from t to r while popping u. 

We now show that any string accepted by M can be generated by G and any string 
generated by G can be accepted by M . It follows that L(M) = L(G). Instead of showing 
this directly, we establish a more general result. 

CLAIM: For all r,t € Q and u eT U {e}, < r,u,t >4> G w if and only if the PDA M 
can move from state r to state t while reading w and popping u from the stack. 

The theorem follows from the claim because < s,e, f >=$>g ^ if and only if the PDA 
M can move from initial state s to a final state / while reading w and leaving the stack 
empty, that is, if and only if M accepts w. 

We first establish the "if" portion of the claim, namely, if for r, t G Q and uGTU {e} 
the PDA M can move from r to t while reading w and popping u from the stack, then 
< r,u,t >=>g w - The proof is by induction on the number of steps taken by M. If no 
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step is taken (basis for induction), r = t, nothing is popped and the string e is read by M. 
Since the grammar G contains the rule < r, e, r >— > e, the basis is established. 

Suppose that the "if" portion of the claim is true for k or fewer steps (inductive hypoth- 
esis). We show that it is true for k + 1 steps (induction step). If the PDA M can move 
from r to t in k + 1 steps while reading w = xv and removing u from the stack, then on 
its first step it must execute a transition (r, x, y; q, z), q g Q, z g T U {e}, for a; g £ with 
either y = u if u 7^ e or y = e. In the first case, M enters state (7, pops u, and pushes 
z. M subsequently pops z as it reads v and moves to state t in k steps. It follows from the 
inductive hypothesis that < q, z, t >=>g v - Since |//e,i rule of type (3) applies, that is, 

< r,y,t >^ x < q, z,t >. It follows that < r,y,t >=>g w, the desired conclusion. 

In the second case y = e and il/ makes the transition (r, a;, e; g, 0) by moving from r to 
t and pushing z while reading x. To pop u, which must have been at the top of the stack, M 
must first pop z and then pop u. Let it pop z as it moves from q to some intermediate state 
t' while reading a first portion Vy of the input word v. Let it pop U as it moves from t' to t 
while reading a second portion V2 of the input word v. Here V\ V2 - v. Since the move from 
q to t' and from t 1 to i each involves at most k steps, it follows that the goals < q, z, t' > 
and < t',U,r > satisfy < q,z,t' >=>g v l and < t',U,r >=>g v 2- Because M's first 
transition meets condition (4), there is a rule < r,u,t >— > CC < q,z,t' >< t',U,r >. 
Combining these derivations yields the desired conclusion. 

Now we establish the "only if" part of the claim, namely, if for all r,( 6 Q and u g 
r U {e}, < r,u,t >=>g *u> then the PDA M can move from state r to state t while 
reading u; and removing u from the stack. Again the proof is by induction, this time on 
the number of derivation steps. If there is a single derivation step (basis for induction), 
it must be of the type stated in condition (2), namely < p,£,p >— > e. Since M can 
move from state p to p without reading the tape or pushing data onto its stack, the basis is 
established. 

Suppose that the "only if" portion of the claim is true for k or fewer derivation steps 
(inductive hypothesis). We show that it is true for k + 1 steps (induction step). That is, 
if < r, U, t >=>g w m k + 1 steps, then we show that M can move from r to t while 
reading w and popping u from the stack. We can assume that the first derivation step is of 
type (3) or (4) because if it is of type (2), the derivation can be shortened and the result fol- 
lows from the inductive hypothesis. If the first derivation is of type (3), namely, of the form 

< r,u,t >— > x < q, z, t >, then by the inductive hypothesis, M can execute (r, x, u; q, z), 
u =/= e, that is, read x, pop u, push z, and enter state q. Since < r,u,t >=^>g w > where 
w = xv, it follows that < q, z,t >=>g v - Again by the inductive hypothesis M can move 
from q to t while reading v and popping z. Combining these results, we have the desired 
conclusion. 

If the first derivation is of type (4), namely, < r,u,t >— > x < q,z,t' >< t' ,u,t >, 
then the two non-terminals < q,z,t' > and < t',U, t > must expand to substrings Vi 
and V2, respectively, of v where w = xV\Vi = xv. That is, < q,z,t' >=^g v i and 

< t' , u, t >=^g v i- By the inductive hypothesis, M can move from q to t' while read- 
ing V\ and popping z and it can also move from t to t while reading V2 and popping 
U. Thus, AI can move from r to t while reading w and popping u, which is the desired 
conclusion. ■ 



©John E Savage 



4.13 Properties of Context-Free Languages 



197 



4.13 Properties of Context-Free Languages 

In this section we derive properties of context-free languages. We begin by establishing a 
pumping lemma that demonstrates that every CFL has a certain periodicity property. This 
property, together with other properties concerning the closure of the class of CFLs under the 
operations of concatenation, union and intersection, is used to show that the class is not closed 
under complementation and intersection. 

4.13.1 CFL Pumping Lemma 

The pumping lemma for regular languages established in Section 4.5 showed that if a regular 
language contains an infinite number of strings, then it must have strings of a particular form. 
This lemma was used to show that some languages are not regular. We establish a similar result 
for context-free languages. 

LEMMA 4. 13.1 Let G = (J\f,T,lZ,S) be a context-free grammar in Chomsky normal form 
with m non-terminals. Then, ifrw £ L(G) and \w\ > 2 m ~ l + 1, there are strings r, s, t, 
u, andv with w = rstuv such that \su\ > 1 and \stu\ < 2 m and for all integers n > 0, 
S 4> G rs n tu n v £ L(G). 

Proof Since each production is of the form A — > BC or A — > a, a subtree of a parse tree of 
height h has a yield (number of leaves) of at most 2 . To see this, observe that each rule 
that generates a leaf is of the form A — » a. Thus, the yield is the number of leaves in a binary 
tree of height h — \, which is at most 2 . 

Let K = 2 m ~ l + 1. If there is a string w in L of length K or greater, its parse tree 
has height greater than m. Thus, a longest path P in such a tree (see Fig. 4.31(a)) has more 





(a) 



(b) 



Figure 4.31 L(G) is generated by a grammar G in Chomsky normal form with m non- 
terminals, (a) Each w £ L(G) with |io| > 2 m ~ + 1 has a parse tree with a longest path P 
containing at least m + 1 non-terminals, (b) SP, the portion of P containing the last m + 1 
non-terminals on P, has a non-terminal A that is repeated. The derivation A — > SAU can be 
deleted or repeated to generate new strings in L(G). 
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than m non-terminals on it. Consider the subpath SP of P containing the last m + 1 
non-terminals of P. Let D be the first non-terminal on SP and let the yield of its parse tree 
be y. It follows that \y\ < 2 rn . Thus, the yield of the full parse tree, w, can be written as 
w = xyz for strings x, y, and z in T*. 

By the pigeonhole principle stated in Section 4.5, some non-terminal is repeated on SP. 
Let A be such a non-terminal. Consider the first and second time that A appears on SP. 
(See Fig. 4.31(b).) Repeat all the rules of the grammar G that produced the string y except 
for the rule corresponding to the first instance of A on SP and all those rules that depend 
on it. It follows that D =k> ahb where a and b are in T* . Similarly, apply all the rules to 
the derivation beginning with the first instance of A on P up to but not including the rules 
beginning with the second instance of A. It follows that A => sAu, where s and u are in T* 
and at least one is not e since no rules of the form A — > B are in G. Finally, apply the rules 
starting with the second instance of A on P. Let A => t be the yield of this set of rules. Since 
A =>• sA« and A => t, it follows that L also contains xatbz. L also contains xas n tu n bz 
for n > 1 because A =£- sAu can be applied n times after A => sAu and before A =4* t. Now 
let r = xa and v = bz. ■ 

We use this lemma to show the existence of a language that is not context-free. 

LEMMA 4. 1 3.2 The language L = {a n b n c n | n > 0} over the alphabet £ = {a, b, c} is not 

context-free. 

Proof We assume that L is context-free generated by a grammar with m non-terminals and 
show this implies L contains strings not in the language. Let no = 2 m ~ + 1 . 

Since L is infinite, the pumping lemma can be applied. Let rstuv = a n b n c n for n = 
no. From the pumping lemma rs 2 tu~v is also in L. Clearly if s or u is not empty (and at 
least one is), then they contain either one, two, or three of the symbols in £. If one of them, 
say s, contains two symbols, then s contains a b before an a or a c before a b, contradicting 
the definition of the language. The same is true if one of them contains three symbols. 
Thus, they contain exactly one symbol. But this implies that the number of as, b's, and c's 
in rs 2 tu v is not the same, whether or not s and u contain the same or different symbols. ■ 

4.13.2 CFL Closure Properties 

In Section 4.6 we examined the closure properties of regular languages. We demonstrated that 
they are closed under concatenation, union, Kleene closure, complementation, and intersec- 
tion. In this section we show that the context-free languages are closed under concatenation, 
union, and Kleene closure but not complementation or intersection. A class of languages is 
closed under an operation if the result of performing the operation on one or more languages 
in the class produces another language in the class. 

The concatenation, union, and Kleene closure of languages are defined in Section 4.3. The 
concatenation of languages L\ and Li, denoted L\ -Lj, is the language {uv \ u G L\ and v € 
L 2 }. The union of languages L\ and Lx, denoted L\ U L2, is the set of strings that are in L\ 
or L 2 or both. The Kleene closure of a language L, denoted L* and called the Kleene star, is 
the language U^o U where L ° = M and L l = L ■ L l ~\ 

THEOREM 4. 1 3. 1 The context-free languages are closed under concatenation, union, and Kleene 
closure. 
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Proof Consider two arbitrary CFLs L{H{) and L(H 2 ) generated by grammars Hi = 
(J\fi,Ti,lZi,S\) and H.2 = (A/2, T2, Ti-j, Sj). Without loss of generality assume that their 
non-terminal alphabets (and rules) are disjoint. (If not, prefix every non-terminal in the 
second grammar with a symbol not used in the first. This does not change the language 
generated.) 

Since each string in L{H\) ■ L(H2) consists of a string of L(Hi) followed by a string 
of L(H2), it is generated by the context-free grammar H3 = (A/3, 7^, IZ3, S3) in which 
7V 3 = A/i U A/2 U {S3}, T 3 = Ti U T 2 , and K 3 =UiUK 2 U {s 3 -> SiS 2 }. The new rule 
S3 — > S1S2 generates a string o£L{H\) followed by a string of L(H2). Thus, L(H\) -L(H2) 
is context-free. 

The union of languages L(H\) and L(H2) is generated by the context-free grammar 
H A = (A/4, T 4 , K A , S 4 ) in which A/" 4 = A/i U A/2 U {s 4 }, 7^ = 7j U 7^, and R 4 = 'RiU 
7^-2 U {S4 — > Si, S4 — > S2}. To see this, observe that after applying S4 — > Si all subsequent 
rules are drawn from H\. (The sets of non-terminals are disjoint.) A similar statement 
applies to the application of S4 — > S2. Since 7?4 is context-free, L(H A ) = L(H\) U £(#2) 
is context-free. 

The Kleene closure ofL(Hi), namely L(H 1)* , is generated by the context-free grammar 
-ff 5 = (A/"i, Ti, 7^.5, Si) in which 7?. 5 = 1Z\ U {Si — > e,Si — > S1S1}. To see this, observe 
that L(H$) includes e, every string in L(H\), and, through %—\ applications of Si — > Si Si, 
every string in L(H\) % . Thus, L{H\)* is generated by H$ and is context-free. ■ 

We now use this result and Lemma 4.13.2 to show that the set of context-free languages 
is not closed under complementation and intersection, operations defined in Section 4.6. The 
complement of a language L over an alphabet S, denoted L, is the set of strings in E* that are 
not in L. The intersection of two languages L\ and L2, denoted L\ n L2, is the set of strings 
that are in both languages. 

THEOREM 4. 1 3.2 The set of context-free languages is not closed under complementation or inter- 
section. 

Proof The intersection of two languages L\ and L2 can be defined in terms of the comple- 
ment and union operations as follows: 

U H L 2 = S* - (E* - L^ U (E* - L 2 ) 

Thus, since the union of two CFLs is a CFL, if the complement of a CFL is also a CFL, from 
this identity, the intersection of two CFLs is also a CFL. We now show that the intersection 
of two CFLs is not always a CFL. 

The language L\ = {a n b n c m | n, m > 0} is generated by the grammar Hi = (A/i, 7\, 
Hi, Si), where A/"i = {s, A, b}, 7"i = {a,b,c}, and the rules TZi are: 

a) S — > AB d) B — » Be 

b) A — > aAb e) B — > e 

c) A — > e 

The language L 2 = {a m b n c n \ n, m > 0} is generated by the grammar H2 = (A/2, 72, 
7^2, S2), where A/2 = {s, A, b}, 72 = {a, 6, c} and the rules TZ 2 are: 
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a) 


s - 


-> AB 


d) 


b) 


A - 


-> aA 


e) 


c) 


A - 


-> e 





OBC 

f 



Thus, the languages Lj and L 2 are context-free. However, their intersection is L\ C\L 2 = 
{a n b n c n | n > 0}, which was shown in Lemma 4.13.2 not to be context-free. Thus, the set 
of CFLs is not closed under intersection, nor is it closed under complementation. ■ 



Problems 

FSM MODELS 

4.1 Let M = (Y,,i&,Q,5,\,s,F) be the FSM model described in Definition 3.1.1. It 
differs from the FSM model of Section 4. 1 in that its output alphabet ^ has been 
explicitly identified. Let this machine recognize the language L(M) consisting of input 
strings w that cause the last output produced by M to be the first letter in \&. Show 
that every language recognized under this definition is a language recognized according 
to the "final-state definition" in Definition 4.1.1 and vice versa. 

4.2 The Mealy machine is a seven-tuple M = (£, \&, Q, S, A, s, F) identical in its def- 
inition with the Moore machine of Definition 3.1.1 except that its output function 
A: QxSm $ depends on both the current state and input letter, whereas the output 
function A : Q >— > ^ of the Moore FSM depends only on the current state. Show that 
the two machines recognize the same languages and compute the same functions with 
the exception of e. 

4.3 Suppose that an FSM is allowed to make state e-transitions, that is, state transitions 
on the empty string. Show that the new machine model is no more powerful than the 
Moore machine model. 

Hint: Show how e-transitions can be removed, perhaps by making the resultant FSM 
nondeterministic. 

EQUIVALENCE OF DFSMS AND NFSMS 

4.4 Functions computed by FSMs are described in Definition 3.1.1. Can a consistent 
definition of function computation by NFSMs be given? If not, why not? 

4.5 Construct a deterministic FSM equivalent to the nondeterministic FSM shown in 
Fig. 4.32. 

REGULAR EXPRESSIONS 

4.6 Show that the regular expression 0(0*10*) + defines strings starting with and con- 
taining at least one 1 . 

4.7 Show that the regular expressions 0*, 0(0*10*) + , and 1(0 + 1)* partition the set of all 
strings over and 1 . 

4.8 Give regular expressions generating the following languages over £ = {0, 1}: 
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Figure 4.32 A nondeterministic FSM. 



a) L = {w | w has length at least 3 and its third symbol is a 0} 

b) L = {w | w begins with a 1 and ends with a 0} 

c) L = {w | w contains at least three Is} 

4.9 Give regular expressions generating the following languages over S = {0, 1}: 

a) L = {w | w is any string except 1 1 and 111} 

b) L = {w | every odd position of w is a 1} 

4.10 Give regular expressions for the languages over the alphabet {0, 1, 2, 3, 4, 5, 6, 7, 8, 
9} describing positive integers that are: 

a) even 

b) odd 

c) a multiple of 5 

d) a multiple of 4 

4.1 1 Give proofs for the rules stated in Theorem 4.3.1. 

4.12 Show that e + 01 + (010)(10 + 010)*(e +1 + 01) and (01 + 010)* describe the same 
language. 

REGULAR EXPRESSIONS AND FSMS 



4.13 a) Find a simple nondeterministic finite-state machine accepting the language (01 U 

001 U 010)* overS = {0,1}. 
b) Convert the nondeterministic finite state machine of part (a) to a deterministic 
finite-state machine by the method of Section 4.2. 

4.14 a) Let S = {0, 1,2}, and let L be the language over £ that contains each string 

w ending with some symbol that does not occur anywhere else in w. For exam- 
ple, 011012, 20021, 11120, 0002, 10, and 1 are all strings in L. Construct a 
nondeterministic finite-state machine that accepts L. 
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b) Convert the nondeterministic finite-state machine of part (a) to a deterministic 
finite-state machine by the method of Section 4.2. 

4.15 Describe an algorithm to convert a regular expression to an NFSM using the proof of 
Theorem 4.4.1. 

4.16 Design DFSMs that recognize the following languages: 

a) a*bca* 

b) (a + c)* (ab + ca)b* 

c) {a*b*{b + c)*)* 

4.17 Design an FSM that recognizes decimal strings (over the alphabet {0, 1, 2, 3, 4, 5, 6, 
7, 8, 9} representing the integers whose value is modulo 3. 

Hint: Use the fact that (10)* = 1 mod 3 (where 10 is "ten") to show that (afc(10) fc + 

a fc _i(10) fc_1 H 1- ai(lO) 1 + a ) mod 3 = (a k + a fc _i -\ h Oi + a ) mod 3. 

4.18 Use the above FSM design to generate a regular expression describing those integers 
whose value is modulo 3. 

4.19 Describe an algorithm that constructs an NFSM from a regular expression r and accepts 
a string w if w contains a string denoted by r that begins anywhere in w. 

THE PUMPING LEMMA 

4.20 Show that the following languages are not regular: 

a) L = {a n ba n \ n > 0} 

b) L = {0"l 2n 0" | n> 1} 

c) L = {a"6 n c n | n > 0} 

4.21 Strengthen the pumping lemma for regular languages by demonstrating that if L is 
a regular language over the alphabet £ recognized by a DFSM with m states and it 
contains a string w of length m or more, then any substring z of w (w = uzv) of 
length ?72 can be written as z = rst, where \s\ > 1 such that for all integers n > 0, 
urs n tv G L. Explain why this pumping lemma is stronger than the one stated in 
Lemma 4.5.1. 

4.22 Show that the language L = {a'b 3 \ i > j} is not regular. 

4.23 Show that the following language is not regular: 
a) {u n zv m zw n+m | n,m > 1} 

PROPERTIES OF REGULAR LANGUAGES 

4.24 Use Lemma 4.5.1 and the closure property of regular languages under intersection to 
show that the following languages are not regular: 

a) {ww R | w e {0, 1}*} 

b) {ww | where w denotes w in which 0's and Is are interchanged} 

c) {w | w has equal number of 0's and l's} 

4.25 Prove or disprove each of the following statements: 
a) Every subset of a regular language is regular 
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b) Every regular language has a proper subset that is also a regular language 

c) If L is regular, then so is {xy \ x € L and y L} 

d) If L is a regular language, then so is {w : w G L and it) gl} 

e) {iu | iv = w R } is regular 

STATE MINIMIZATION 

4.26 Find a minimal-state FSM equivalent to that shown in Fig. 4.33. 

4.27 Show that the languages recognized by M and M= are the same, where = is the equiv- 
alence relation on M defined by states that are indistinguishable by input strings of any 
length. 

4.28 Show that the equivalence relation Rl is right-invariant. 

4.29 Show that the equivalence relation Rm is right-invariant. 

4.30 Show that the right-invariance equivalence relation (defined in Definition 4.7.2) for the 
language L = {a n b n \ n > 0} has an unbounded number of equivalence classes. 

4.31 Show that the DFSM in Fig. 4.20 is the machine Ml associated with the language 
L = (10*1 +0)*. 

PUSHDOWN AUTOMATA 

4.32 Construct a pushdown automaton that accepts the following language: L = {w \ w is 
a string over the alphabet £ = {(, )} of balanced parentheses}. 

4.33 Construct a pushdown automaton that accepts the following language: L = {w \ w 
contains more Is than s}. 



Start 




Figure 4.33 A four-state finite-state machine. 
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PHRASE STRUCTURE LANGUAGES 

4.34 Give phrase-structure grammars for the following languages: 

a) {ww j in £ {a, b}*} 

b) {0 2 * | i > 1} 

4.35 Show that the following language can be described by a phrase-structure grammar: 

{a 1 | i is not prime} 

CONTEXT-SENSITIVE LANGUAGES 

4.36 Show that every context-sensitive language can be accepted by a linear bounded au- 
tomaton (LBA), a nondeterministic Turing machine in which the tape head visits a 
number of cells that is a constant multiple of the number of characters in the input 
string w. 

Hint: Consider a construction similar to that used in the proof of Theorem 5.4.2. 
Instead of using a second tape, use a second track on the tape of the TM. 

4.37 Show that every language accepted by a linear bounded automaton can be generated by 
a context-sensitive language. 

Hint: Consider a construction similar to that used in the proof of Theorem 5.4.1 but 
instead of deleting characters at the end of TM configuration, encode the end markers 
[ and ] by enlarging the tape alphabet of the LBA to permit the first and last characters 
to be either marked or unmarked. 

4.38 Show that the grammar G\ in Example 4.9.1 is context-sensitive and generates the 
language L(G{) = {a n b n c n | n > l}. 

4.39 Show that the language {0 2 | i > 1} is context-sensitive. 

4.40 Show that the context-sensitive languages are closed under union, intersection, and 
concatenation. 

CONTEXT-FREE LANGUAGES 

4.41 Show that language generated by the context-free grammar G$ of Example 4.9.3 is 

L(G 3 ) = {ca n ca n cb m cb m c \n,m> 0}. 

4.42 Construct context-free grammars for each of the following languages: 

a) {ww R | w £ {a, b}*} 

b) {w | w £ {a, b}*, w = w R } 

c) L = {w | w has twice as many 0's as Is} 

4.43 Give a context-free grammars for each of the following languages: 

a) {w £ {a, b}* | w has twice as many as as b's} 

b) {a r b s | r < s < 2r} 
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REGULAR LANGUAGES 

4.44 Show that the regular language G4 described in Example 4.9.4 is L{G$) = (01)*0. 

4.45 Show that grammar G = (M,T,1Z, S), where M = {a, B, s}, T = {a, b} and the 
rules 1Z are given below, is regular. 

a) S — » abA d) S — > e /) B — > as 

6) S — > 6aB e) A — > 6s g) A — > 6 

c) S — > B 
Give a derivation for the string abbbaa. 

4.46 Provide a regular grammar generating strings over {0, 1} not containing 00. 

4.47 Give a regular grammar for each of the following languages and show that there is a 
FSM that accepts it. In all cases S = {0, 1}. 

a) L = {w | the length of it) is odd} 

b) L = {w | w contains at least three Is} 

REGULAR LANGUAGE RECOGNITION 

4.48 Construct a finite-state machine that recognizes the language generated by the grammar 
G = (A/", T, K, s), where M = {S, X, y}, T = {x, y}, and 11 contains the following 
rules: S — > XX, S — > yY,X^ yY, Y — > a;X, X — > e, and Y — > e. 

4.49 Describe finite-state machines that recognize the following languages: 

a) {w G {a, b}* \ w has an odd number of as} 

b) {w g {a, 6}* | it) has afr and 6a as substrings} 

4.50 Show that, if L is a regular language, then the language obtained by reversing the letters 
in each string in L is also regular. 

4.51 Show that, if L is a regular language, then the language consisting of strings in L whose 
reversals are also in L is regular. 

PARSING CONTEXT-FREE LANGUAGES 

4.52 Use the algorithm of Theorem 4.1 1.2 to construct a parse tree for the string (a * 6 + 
a) * (a + b) generated by the grammar G5 of Example 4.1 1.2, and give a leftmost and 
a rightmost derivation for the string. 

4.53 Let G = (M, T, 1Z, s) be the context-free grammar with M = S and T = {(, ), 0} 
with rules 1Z = {S — > 0, S — > SS, S — > (s)}. Use the algorithm of Theorem 4.11.2 to 
generate a parse tree for the string (0)((0)). 

CFL ACCEPTANCE WITH PUSHDOWN AUTOMATA 

4.54 Construct PDAs that accept each of the following languages: 

a) {a n b n I n > 0} 

b) {ww R I w e {a, b}*} 

c) {w J w G {a, b}* , it) = w R } 
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4.55 Construct PDAs that accept each of the following languages: 

a) {w G {a, b}* \ w has twice as many as as b's} 

b) {a r b s | r < s < 2r} 

4.56 Use the algorithm of Theorem 4.12.2 to construct a context-free grammar that accepts 
the language accepted by the PDA in Example 4.8.2. 

4.57 Construct a context-free grammar for the language {wcw \ w G {a, 6}*}. 

Hint: Use the algorithm of Theorem 4.12.2 to construct a context-free grammar that 
accepts the language accepted by the PDA in Example 4.8.1. 

PROPERTIES OF CONTEXT-FREE LANGUAGES 

4.58 Show that the intersection of a context-free language and a regular language is context- 
free. 

Hint: From machines accepting the two language types, construct a machine accepting 
their intersection. 

4.59 Suppose that L is a context-free language and R is a regular one. Is L — R necessarily 
context-free? What about R — LI Justify your answers. 

4.60 Show that, if L is context-free, then so is L R = {w R \ w G L}. 

4.61 Let G = (Af,T,lZ,s) be context-free. A no n- terminal A is self-embedding if and 
only if A =>g sAu for some S, u G T . 

a) Give a procedure to determine whether A G M is self-embedding. 

b) Show that, if G does not have a self-embedding non-terminal, then it is regular. 

CFL PUMPING LEMMA 

4.62 Show that the following languages are not context-free: 

a) {0 r | i > 1} 

b) {b nl \n> 1} 

c) {0 n | n is a prime} 

4.63 Show that the following languages are not context-free: 

a) {0™l"0™l n | n> 0} 

b) {a l Vc k \0<i<j<k} 

c) {ww | w G {0, 1}*} 

4.64 Show that the language {ww \ w G {a, &}*} is not context-free. 

CFL CLOSURE PROPERTIES 

4.65 Let Mi and M 2 be pushdown automata accepting the languages L(Mi) and L(M 2 ). 
Describe PDAs accepting their union L(Mi)UL(M2), concatenation L(M\) -L^Mx), 
and Kleene closure L(M\)* , thereby giving an alternate proof of Theorem 4.13.1. 

4.66 Use closure under concatenation of context-free languages to show that the language 
{ww R v R v I w,v G {a,b}*} is context-free. 
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Chapter Notes 

The concept of the finite-state machine is often attributed to McCuIloch and Pitts [211]. 
The models studied today are due to Moore [223] and Mealy [215]. The equivalence of 
deterministic and non-deterministic FSMs (Theorem 4.4.1) was established by Rabin and 
Scott [266]. 

Kleene established the equivalence of regular expressions and finite-state machines. The 
proof used in Theorems 4.4.1 and 4.4.2 is due to McNaughton and Yamada [212]. The 
pumping lemma (Lemma 4.5.1) is due to to Bar-Hillel, Perles, and Shamir [28]. The closure 
properties of regular expressions are due to McNaughton and Yamada [212], 

State minimization was studied by Huffman [144] and Moore [223]. The Myhill-Nerode 
Theorem was independently obtained by Myhill [227] and Nerode [229]. Hopcroft [139] has 
given an efficient algorithm for state minimization. 

Chomsky [68,69] defined four classes of formal language, the regular, context-free, context- 
sensitive, and phrase-structure languages. He and Miller [71] demonstrated the equivalence 
of languages generated by regular grammars and those recognized by finite-state machines. 
Chomsky introduced the normal form that carries his name [69] . Oettinger [233] introduced 
the pushdown automaton and Schutzenberger [305], Chomsky [70], and Evey [97] indepen- 
dently demonstrated the equivalence of context-free languages and pushdown automata. 

Two efficient algorithms for parsing context-free languages were developed by Earley [94] 
and Cocke (unpublished) and independently by Kasami [162] and Younger [371]. These are 
cubic-time algorithms. Our formulation of the parsing algorithm of Section 4.11 is based 
on Valiant's derivation [342] of the Cocke-Kasami-Younger recognition matrix, where he also 
presents the fastest known general algorithm to parse context-free languages. The CFL pump- 
ing lemma and the closure properties of CFLs are due to Bar-Hillel, Perles, and Shamir [28] . 

Myhill [228] introduced the deterministic linear-bounded automata and Landweber [189] 
showed that languages accepted by linear-bounded automata are context-sensitive. Kuroda 
[184] generalized the linear-bounded automata to be nondeterministic and established the 
equivalence of such machines and the context-sensitive languages. 



CHAPTER 




Computability 



The Turing machine (TM) is believed to be the most general computational model that can 
be devised (the Church-Turing thesis). Despite many attempts, no computational model has 
yet been introduced that can perform computations impossible on a Turing machine. This 
is not a statement about efficiency; other machines, notably the RAM of Section 3.4, can do 
the same computations either more quickly or with less memory. Instead, it is a statement 
about the feasibility of computational tasks. If a task can be done on a Turing machine, it is 
considered feasible; if it cannot, it is considered infeasible. Thus, the TM is a litmus test for 
computational feasibility. As we show later, however, there are some well-defined tasks that 
cannot be done on a TM. 

The chapter opens with a formal definition of the standard Turing machine and describes 
how the Turing machine can be used to compute functions and accept languages. We then 
examine multi-tape and nondeterministic TMs and show their equivalence to the standard 
model. The nondeterministic TM plays an important role in Chapter 8 in the classification of 
languages by their complexity. The equivalence of phrase-structure languages and the languages 
accepted by TMs is then established. The universal Turing machine is defined and used to 
explore limits on language acceptance by Turing machines. We show that some languages 
cannot be accepted by any Turing machine, while others can be accepted but not by Turing 
machines that halt on all inputs (the languages are unsolvable). This sets the stage for a proof 
that some problems, such as the Halting Problem, are unsolvable; that is, there is no Turing 
machine halting on all inputs that can decide for an arbitrary Turing machine M and input 
string w whether or not M will halt on w. We close by defining the partial recursive functions, 
the most general functions computable by Turing machines. 
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5.1 The Standard Turing Machine Model 

The standard Turing machine consists of a control unit, which is a finite-state machine, and 
a (single-ended) infinite-capacity tape unit. (See Fig. 5.1.) Each cell of the tape unit initially 
contains the blank symbol f3. A string of symbols from the tape alphabet T is written left- 
adjusted on the tape and the tape head is placed over the first cell. The control unit then reads 
the symbol under the head and makes a state transition the result of which is either to write 
a new symbol under the tape head or to move the head left (if possible) or right. (The TM 
described in Section 3.7 is slightly different; it always replaces the cell contents and always 
issues a move command, even if the effect in both cases is null. The equivalence between the 
standard TM and that described in Section 3.7 is easily established. See Problem 5.1.) A move 
left from the first cell leads to abnormal termination, a problem that can be avoided by having 
the Turing machine write a special end-of-tape marker in the first tape cell. This marker is a 
tape symbol not used elsewhere. 

DEFINITION 5.1.1 A standard Turing machine (TM) is a six-tuple M = ( T, j3, Q, S, s, h) 
where T is the tape alphabet not containing the blank symbol j3, Q is the finite set of states, 
5 : Q x ( T U {[3}) h(QU {h}) x ( T U {f3} U {L, R}) is the next-state function, s is the 
initial state, and h £ Q is the accepting halt state. A TM cannot exit from h. If M is in state 
q with letter a under the tape head and 5(q, a) = (q' , C), its control unit enters state q' and writes 
a' ifC = a' 6 T U {/3} or moves the head left (if possible) or right if C is L or R, respectively. 

The TM M accepts the input string w £ T* (it contains no blanks) if when started in state 
s with w placed left-adjusted on its otherwise blank tape and the tape head at the leftmost tape cell, 
the last state entered by M is h. M accepts the language L(M) consisting of all strings accepted 
by M. Languages accepted by Turing machines are called recursively enumerable. A language 
L is decidable or recursive if there exists a TM M that halts on every input string, whether in L 
or not, and accepts exactly the strings in L. 

A function f : T* i— > T* U {-L}, where _L is a symbol that is not in T, is partial if for some 
w G r*, f(w) = _L (f is not defined on w). Otherwise, f is total. 

A TM KI computes a function / : T* i— > T* U _L for those w such that f(w) is defined if 
when started in state s with w placed left-adjusted on its otherwise blank tape and the tape head 
at the leftmost tape cell, M enters the accepting halt state h with f(w) written left-adjusted on its 
otherwise blank tape. If a TM halts on all inputs, it implements an algorithm. A task defined by 
a total function f is solvable iff has an algorithm and unsolvable otherwise. 
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Figure 5. 1 The control and tape units of the standard Turing machine. 
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Figure 5.2 An accepter (a) for a language L is a Turing machine that can accept strings in a 
language L but may not halt on all inputs. A decider or recognizer (b) for a language L is a Turing 
machine that halts on all inputs and accepts strings in L. 



The accepting halt state h has been singled out to emphasize language acceptance. How- 
ever, there is nothing to prevent a TM from having multiple halt states, states from which it 
does not exit. (A halt state can be realized by a state to which a TM returns on every input 
without moving the tape head or changing the value under the head.) On the other hand, on 
some inputs a TM may never halt. For example, it may endlessly move its tape head right one 
cell and write the symbol a. 

Notice that we do not require a TM M to halt on every input string for it to accept a 
language L(AI). It need only halt on those strings in the language. A language L for which 
there is a TM M accepting L = L(M) that halts on all inputs is decidable. The distinction 
between accepting and recognizing (or deciding) a language L is illustrated schematically in 
Fig. 5.2. An accepter is a TM that accepts strings in L but may not halt on strings not in L. 
When the accepter determines that the string w is in the language L, it turns on the "Yes" 
light. If this light is not turned on, it may be that the string is not in L or that the TM is just 
slow. On the other hand, a recognizer or decider is a TM that halts on all inputs and accepts 
strings in L. The "Yes" or "No" light is guaranteed to be turned on at some time. 

The computing power of the TM is extended by allowing partial computations, com- 
putations on which the TM does not halt on every input. The computation of functions by 
Turing machines is discussed in Section 5.9. 

5.1.1 Programming the Turing Machine 

Programming a Turing machine means choosing a tape alphabet and designing its control 
unit, a finite-state machine. Since the FSM has been extensively studied elsewhere, we limit 
our discussion of programming of Turing machines to four examples, each of which illustrates 
a fundamental point about Turing machines. Although TMs are generally designed to perform 
unbounded computations, their control units have a bounded number of states. Thus, we must 
insure that as they move across their tapes they do not accumulate an unbounded amount of 
information. 

A simple example of a TM is one that moves right until it encounters a blank, whereupon 
it halts. The TM of Fig. 5.3(a) performs this task. If the symbol under the head is or 1, 
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Figure 5.3 The transition functions of two Turing machines, one (a) that moves across the 
non-blank symbols on its tape and halts over the first blank symbol, and a second (b) that moves 
the input string right one position and inserts a blank to its left. 



it moves right. If it is the blank symbol, it halts. This TM can be extended to replace the 
rightmost character in a string of non-blank characters with a blank. After finding the blank 
on the right of a non-blank string, it backs up one cell and replaces the character with a blank. 
Both TMs compute functions that map strings to strings. 

A second example is a TM that replaces the first letter in its input string with a blank and 
shifts the remaining letters right one position. (See Fig. 5.3(b).) In its initial state q\ this TM, 
which is assumed to be given a non-blank input string, records the symbol under the tape head 
by entering (72 if the letter is or q$ if the letter is 1 and writing the blank symbol. In its 
current state it moves right and enters a corresponding state. (It enters 94 if its current state 
is (fc and q*, if it is q$.) In the new state it prints the letter originally in the cell to its left and 
enters either (72 or q$ depending on whether the current cell contains or 1. This TM can 
be used to insert a special end-of-tape marker instead of a blank to the left of a string written 
initially on a tape. This idea can generalized to insert a symbol anyplace in another string. 

A third example of a TM M is one that accepts strings in the language L = {a n b n c n \ n > 
1}. M inserts an end-of-tape marker to the left of a string w placed on its tape and uses a 
computation denoted C(x,y), in which it moves right across zero or more x's followed by 
zero or more "pseudo-blanks" (a symbol other than a, b, c, or j3) to an instance of y, entering 
a non-accepting halt state / if some other pattern of letters is found. Starting in the first cell, 
if M discovers that the next letter is not a, it exits to state /. If it is a, it replaces a by a 
pseudo-blank. It then executes C(a, b). M then replaces b by a pseudo-blank and executes 
C(b, c), after which it replaces c by a pseudo-blank and executes C(c, ft). It then returns to 
the beginning of the tape. If it arrives at the end-of-tape marker without encountering any 
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instances of a, b, or c, it terminates in the accepting halt state h. If not, then it moves right 
over pseudo-blanks until it finds an a, entering state / if it finds some other letter. It then 
resumes the process executed on the first pass by invoking C(a, b). This computation either 
enters the non-accepting halt state / or on each pass it replaces one instance each of a, b, and 
c with a pseudo-blank. Thus, M accepts the language L = {a n b n c n \n > 1}; that is, L 
is decidable (recursive). Since M makes one pass over the tape for each instance of a, it uses 
time 0(n . ) on a string of length n. Later we give examples of languages that are recursively 
enumerable but not recursive. 

In Section 3.8 we reasoned that any RAM computation can be simulated by a Turing 
machine. We showed that any program written for the RAM can be executed on a Turing 
machine at the expense of an increase in the running time from T steps on a RAM with S bits 
of storage to a time 0(ST log S) on the Turing machine. 



5.2 Extensions to the Standard Turing Machine Model 

In this section we examine various extensions to the standard Turing machine model and 
establish their equivalence to the standard model. These extensions include the multi-tape, 
nondeterministic, and oracle Turing machines. 

We first consider the double-ended tape Turing machine. Unlike the standard TM that 
has a tape bounded on one end, this is a TM whose single tape is double-ended. A TM of this 
kind can be simulated by a two-track one-tape TM by reading and writing data on the top 
track when working on cells to the right of the midpoint of the tape and reading and writing 
data on the bottom track when working with cells to its left. (See Problem 5.7.) 

5.2.1 Multi-Tape Turing Machines 

A fc-tape Turing machine has a control unit and fc single-ended tapes of the kind shown in 
Fig. 5.1. Each tape has its own head and operates in the fashion indicated for the standard 
model. The FSM control unit accepts inputs from all tapes simultaneously, makes a state 
transition based on this data, and then supplies outputs to each tape in the form of either a 
letter to be written under its head or a head movement command. We assume that the tape 
alphabet of each tape is T. A three-tape TM is shown in Fig. 5.4. A fc-tape TM M& can be 
simulated by a one-tape TM M\ , as we now show. 



THEOREM 5.2. 1 For each k-tape Turing machine Mk there is a one-tape Turing machine M 



such that a terminating T-step computation by M\~ can be simulated in 0(T") steps by M\. 

Proof Let V and V be the tape alphabets of Mk and M\, respectively. Let |r'| = (2|r|) fc 
so that r" has enough letters to allow the tape of M\ to be subdivided into fc tracks, as 
suggested in Fig. 5-5. Each cell of a track contains 2|T| letters, a number large enough to 
allow each cell to contain either a member of T or a marked member of T. The marked 
members retain their original identity but also contain the information that they have been 
marked. As suggested in Fig. 5.5 for a three-tape TM, fc heads can be simulated by one head 
by marking the positions of the fc heads on the tracks of M\ . 

M\ simulates Mk in two passes. First it visits marked cells to collect the letters under 
the original tape heads, after which it makes a state transition akin to that made by Mfc. In a 
second pass it visits the marked cells either to change their entries or to move the simulated 
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Figure 5.4 A three-tape Turing machine. 



Figure 5.5 A single tape of a TM with a large tape alphabet that simulates a three-tape TM 
with a smaller tape alphabet. 



tape heads. If the fc-tape TM executes T steps, it uses at most T + 1 tape cells. Thus each 
pass requires 0(T) steps and the complete computation can be done in 0(T ) steps. ■ 

Multi-tape machines in which the tapes are double-ended are equivalent to multi-tape 
single-ended Turing machines, as the reader can show. 

5.2.2 Nondeterministic Turing Machines 

The nondeterministic standard Turing machine (NDTM) is introduced in Section 3.7.1. 
We use a slightly altered definition that conforms to the definition of the standard Turing 
machine in Definition 5.1.1. 

DEFINITION 5.2. 1 A nondeterministic Turing machine (NDTM) is a seven-tuple M = 
(£, r, (3, Q, 5, s, h) where X is the choice input alphabet, F is the tape alphabet not con- 
taining the blank symbol /3, Q is the finite set of states, 5 : QxSx(ru{/3}) h- > 
(Q U {h}) x ( T U {/3} U {L, R}) U {±} is the next-state function, s is the initial state, 
and h ^ Q is the accepting halt state. A TM cannot exit from h. IfM is in state q with letter 
a under the tape head and S(q, c, a) = (q' , C), its control unit enters state q' and writes a' if 
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q\ qi qk qi qi 





Qk- 



(a) (b) 

Figure 5.6 The construction used to reduce the fan-out of a nondeterministic state. 



C = a' € r U {/?}, or it moves the head left (if possible) or right if C is L or R, respectively. If 
5(q,c,a) = _L, there is no successor to the current state with choice input c and tape symbol a. 

AnNDTMKI reads one character ofits choice input string c G S* on each step. An NDTM 
M accepts string w if there is some choice string c such that the last state entered by M is h when 
M is started in state s with w placed left-adjusted on its otherwise blank tape, and the tape head 
at the leftmost tape cell. An NDTM M accepts the language L(M) C F* consisting of those 
strings w that it accepts. Thus, if w L(M), there is no choice input for which M accepts w. 

If an NDTM has more than two nondeterministic choices for a particular state and letter 
under the tape head, we can design another NDTM that has at most two choices. As suggested 
in Fig. 5.6, for each state q that has k possible next states q\, . ■ ■ ,qu for some input letter, we 
can add k — 1 intermediate states, each with two outgoing edges such that a) in each state the 
tape head doesn't move and no change is made in the letter under the head, but b) each state 
has the same k possible successor states. It follows that the new machine computes the same 
function or accepts the same language as the original machine. Consequently, from this point 
on we assume that there are either one or two next states from each state of an NDTM for 
each tape symbol. 

We now show that the range of computations that can be performed by deterministic and 
nondeterministic Turing machines is the same. However, this does not mean that with the 
identical resource bounds they compute the same set of functions. 

THEOREM 5.2.2 Any language accepted by a nondeterministic standard TM can be accepted by a 
standard deterministic one. 

Proof The proof is by simulation. We simulate all possible computations of a nondeter- 
ministic standard TM AInd on an ln P u t string w by a deterministic three-tape TM Mq 
and halt if we find a sequence of moves by Mj^d that leads to an accepting halt state. Later 
this machine can be simulated by a one-tape TM. The three tapes of Mjy are an input 
tape, a work tape, and enumeration tape. (See Fig. 5.7.) The input tape holds the in- 
put and is never modified. The work tape is used to simulate JI/nd- The enumeration 
tape contains choice sequences used by Md to decide which move to make when simu- 
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Figure 5.7 A three-tape deterministic Turing machine that simulates a nondeterministic Turing 
machine. 



lating A/nd- These sequences are generated in lexicographical order, that is, in the order 
0, 1,00, 01, 10, 11,000,001, .... It is straightforward to design a deterministic TM that 
generates these sequences. (See Problem 5.2.) 

Breadth-first search is used. Since a string w is accepted by a nondeterministic TM if 
there is some choice input on which it is accepted, a deterministic TM A/d that accepts the 
input w accepted by A/nd can be constructed by erasing the work tape, copying the input 
sequence w to the work tape, placing the next choice input sequence in lexicographical or- 
der on the enumeration tape (initially this is the sequence 0), and then simulating A/nd on 
the work tape while reading one choice input from the enumeration tape on each step. If 
A/d runs out of choice inputs before reaching the halt state, the above procedure is restarted 
with the next choice input sequence. This method deterministically accepts the input string 
w if and only if there is some choice input to Mnd on which it is accepted. ■ 

Adding more than one tape to a nondeterministic Turing machine does not increase its 
computing power. To see this, it suffices to simulate a multi-tape nondeterministic Turing 
machine with a single-tape one, using a construction parallel to that of Theorem 5.2.1, and 
then invoke the above result. Applying these observations to language acceptance yields the 
following corollary. 

COROLLARY 5.2.1 Any language accepted by a nondeterministic (multi-tape) Turing machine can 
be accepted by a deterministic standard Turing machine. 

We emphasize that this result does not mean that with identical resource bounds the de- 
terministic and nondeterministic Turing machines compute the same set of functions. 

5.2.3 Oracle Turing Machines 

The oracle Turing machine (OTM) is a multi-tape TM or NDTM with a special oracle 
tape and an associated oracle function h : B* h- » B* , which need not be computable. (See 
Fig. 5.8.) After writing a string z on its oracle tape, the OTM signals to the oracle to replace 
z with the value h(z) of the oracle function. During a computation the OTM may consult 
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Figure 5.8 The oracle Turing machine has an "oracle tape" on which it writes a string (a problem 
instance), after which an "oracle" returns an answer in one step. 



the oracle as many times as it wishes. Time on an OTM is the number of steps taken, where 
one consultation of the oracle is counted as one step. Space is the number of cells used on 
the work tapes of an OTM not including the oracle tape. The OTM machine can be used to 
classify problems. (See Problem 8.15.) 

5.2.4 Representing Restricted Models of Computation 

Now that we have introduced a variety of Turing machine models, we ask how the finite-state 
machine and pushdown automaton fit into the picture. 

The finite-state machine can be viewed as a Turing machine with two tapes, the first a 
read-only input tape and the second a write-only output tape. This TM reads consecutive 
symbols on its input tape, moving right after reading each symbol, and writes outputs on its 
output tape, moving right after writing each symbol. If this TM enters an accepting halt state, 
the input sequence read from the tape is accepted. 

The pushdown automaton can be viewed as a Turing machine with two tapes, a read-only 
input tape and a pushdown tape. The pushdown tape is a standard tape that pushes a new 
symbol by moving its head right one cell and writing the new symbol into this previously 
blank cell. It pops the symbol at the top of the stack by copying the symbol, after which it 
replaces it with the blank symbol and moves its head left one cell. 

The Turing machine can be simulated by two pushdown tapes. The movement of the head 
in one direction can be simulated by popping the top item of one stack and pushing it onto 
the other stack. To simulate the movement of the head in the opposite direction, interchange 
the names of the two stacks. 

The nondeterministic equivalents of the finite-state machine and pushdown automaton 
are obtained by making their Turing machine control units nondeterministic. 
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We now introduce configuration graphs, graphs that capture the state of Turing machines 
with potentially unlimited storage capacity. We begin by describing configuration graphs for 
one-tape Turing machines. 

DEFINITION 5.3. 1 The configuration of a standard Turing machine M at any point in time 
is [x\X2 ■ ■ ■ pxj . . . x n ], where p is the state of the control unit, the tape head is over the jth tape 
cell, and x = (x\, X%, . . . , x n ) is the string that contains all the non-blank symbols on the tape as 
well as the symbol under the head. Here the state p is shown in boldface to the left of the symbol Xj to 
indicate that the tape head is over the jth cell. x n and some of the symbols to its left may be blanks. 

To illustrate such configurations, consider a TM M that is in state p reading the third 
symbol on its tape, which contains xyz. This information is captured by the configuration 
[xypz]. If M changes to state q and moves its head right, then its new configuration is 
[a;j/2q/3] . In this case we add a blank (3 to the right of the string xyz to insure that the head 
resides over the string. 

Because multi-tape TMs are important in classifying problems by their use of temporary 
work space, a definition for the configuration of a multi-tape TM is desirable. We now intro- 
duce a notation for this purpose that is somewhat more cumbersome than used for the standard 
TM. This notation uses an explicit binary number for the position of each tape head. 

DEFINITION 5.3.2 The configuration of a fc-tape Turing machine M is (jp, hi, hi, ■ ■ ■ , h^, 

X\, X2, ■ ■ ■ , Xk), where h r is the position of the head in binary on the rth tape, p is the state of 
the control unit, and x r is the string on the rth tape that includes all the non-blank symbols as well 
as the symbol under the head. 

We now define configuration graphs for deterministic TMs and NDTMs. Because we will 
apply configuration graphs to machines that halt on all inputs, we view them as acyclic. 

DEFINITION 5.3.3 A configuration graph G(A/nd> w) associated with the NDTM M^d is a 
directed graph whose vertices are configurations ofM^^- (See Fig. 5-9.) There is a directed edge 
between two vertices if for some choice input vector c Mnd can move from the first configuration to 




Figure 5.9 The configuration graph G(M^m , w) of a nondeterministic Turing machine A/nd 
on input iv has one vertex for each configuration of A/nd • The graph is acyclic. Heavy edges 
identify the nondeterministic choices associated with each configuration. 
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the second in one step. There is one configuration corresponding to the initial state of the machine 
and one corresponding to the final state. (We assume without loss of generality that, after accepting 
an input string, A/nd enters a cleanup phase during which it places a fixed string on each tape.) 

Configuration graphs are used in the next section to associate a phrase-structure language 
with a Turing machine. They are also used in many places in Chapter 8, especially in Sec- 
tion 8.5.3, where they are used to establish an important relationship between deterministic 
and nondeterministic space classes. 

5.4 Phrase-Structure Languages and Turing Machines 

We now demonstrate that the phrase-structure languages and the languages accepted by Turing 
machines are the same. We begin by showing that every recursively enumerable language 
is a phrase-structure language. For this purpose we use configurations of one-tape Turing 
machines. Then, for each phrase-structure language L we describe the construction of a TM 
accepting L. We conclude that the languages accepted by TMs and described by phrase- 
structure grammars are the same. 

With these conventions as background, if a standard TM halts in its accepting halt state, 
we can require that it halt with (i\j3 on its tape when it accepts the input string w. Thus, 
the TM configuration when a TM halts and accepts its input string is [h/3 1 /?] . Its starting 
configuration is [s/3wiW 2 ■ ■ ■ W n j3], where W = W\W2 ■ ■ ■ VJ n . 

THEOREM 5.4. 1 Every recursively enumerable language is a phrase-structure language. 

Proof Let M = (T, /3, Q, 5, s, h) be a deterministic TM and let L{M) be the recursively 
enumerable language over the alphabet T that it accepts. The goal is to show the existence of 
a phrase-structure grammar G = ( M, T, 1Z, s) that can generate each string w oiL, and no 
others. Since the TM accepting L halts with /3 1/3 on its tape when started with w G L, we 
design a grammar G that produces the configurations of M in reverse order. Starting with 
the final configuration [h/3 1/3], G produces the starting configuration [sfiwiWx ■ ■ ■ W n p\, 
where w = W\W2 ■ ■ ■ w n , after which it strips off the characters [s/3 at the beginning and 
/3] . The grammar G defined below serves this purpose, as we show. 

Let N = Q U {S, /3, [, ]} and T = T. The rules K of G are defined as follows: 



(a) 


S 


- [h/31/3] 


(b) 


0\ 


- /3/3] 


(c) 


[s/3 - 


-> e 


(d) 


ffl ~ 


- P] 


(e) 


/3] 


-> e 


(f) 


xq 


-> px 



forallpe QandxG (TU{/3}) 

such that 5(p, x) = (q, R) 
(g) qzx — > zpx for all p € Q and x,z G (TU {/3}) 

such that S(p, x) = (q, L) 
(h) qy -> px for all p G Q and £ G (TU {/3}) 

such that 5{p, x) = (q,y),y G (T U {/?}) 

These rules are designed to start with the transition S — > [h/3 1/3] (Rule (a)) and then 
rewrite [h/3 1/3] using other rules until the configuration [s/3wiW2 ■ • ■ w n 0\ is reached. At 
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this point Rule (c) is invoked to strip [a/3 from the beginning of the string, and Rule (e) strips 
(i\ from the end, thereby producing the string W\, Wj, ■ ■ ■ , w n that was written initially on 
M's tape. 

Rule (b) is used to add blank space at the right-hand end of the tape. Rules (f)-(h) 
mimic the transitions of M in reverse order. Rule (f) says that if M in state p reading x 
moves to state q and moves its head right, then M's configuration contained the substring 
px before the move and xq after it. Thus, we map x<\ into pa; with the rule x<± — > px. 
Similar reasoning is applied to Rule (g). If the transition S(p,x) = (q,y), y € T U {/?} 
is executed, M's configuration contained the substring px before the step and qy after it 
because the head does not move. 

Clearly, every computation by a TM M can be described by a sequence of configurations 
and the transitions between these configurations can be described by this grammar G. Thus, 
the strings accepted by M can be generated by G. Conversely, if we are given a derivation 
in G, it produces a series of configurations characterizing computations by the TM M in 
reverse order. Thus, the strings generated by G are the strings accepted by M. ■ 

By showing that every phrase-structure language can be accepted by a Turing machine, we 
will have demonstrated the equivalence between the phrase-structure and recursively enumer- 
able languages. 

THEOREM 5.4.2 Every phrase-structure language is recursively enumerable. 

Proof Given a phrase-structure grammar G, we construct a nondeterministic two-tape TM 
M with the property that L(G) = L(M). Because every language accepted by a multi-tape 
TM is accepted by a one-tape TM and vice versa, we have the desired conclusion. 

To decide whether or not to accept an input string placed on its first (input) tape, M 
nondeterministically generates a terminal string on its second (work) tape using the rules of 
G. To do so, it puts G's start symbol on its work tape and then nondeterministically expands 
it into a terminal string using the rules of G. After producing a terminal string, M compares 
the input string with the string on its work tape. If they agree in every position, M accepts 
the input string. If not, M enters an infinite loop. To write the derived strings on its work 
tape, M must either replace, delete, or insert characters in the string on its tape, tasks well 
suited to Turing machines. 

Since it is possible for M to generate every string in L(G) on its work tape, it can accept 
every string in L(G). On the other hand, every string accepted by M is a string that it can 
generate using the rules of G. Thus, every string accepted by M is in L(G). It follows that 
L(M) = L(G). m 

This last result gives meaning to the phrase "recursively enumerable": the languages ac- 
cepted by Turing machines (the recursively enumerable languages) are languages whose strings 
can be enumerated by a Turing machine (a recursive device). Since an NDTM can be simu- 
lated by a DTM, all strings accepted by a TM can be generated deterministically in sequence. 



5.5 Universal Turing Machines 



A universal Turing machine is a Turing machine that can simulate the behavior of an arbitrary 
Turing machine, even the universal Turing machine itself. To give an explicit construction for 
such a machine, we show how to encode Turing machines as strings. 
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Without loss of generality we consider only deterministic Turing machines M = {T,f3,Q, 
6, s, h) that have a binary tape alphabet T = B = {0, 1}. When M is in state p and the 
value under the head is a, the next-state function S : Q x ( T U {/?}) (— ► (Q U {ft.}) x 
( r U {/?} U {L, R}) takes M to state q and provides output z, where J(p, a) = (q, z) and 
z G T U {/3} U {L, R}. 

We now specify a convention for numbering states that simplifies the description of the 
next-state function 5 of M. 

DEFINITION 5.5. 1 The canonical encoding of a Turing machine M, p(M), is a string over the 
10-letter alphabet A = {<, >, [,],#, 0, 1, /3,R,L} formed as follows: 

(a) LetQ = {q\,qi, ■ ■ ■ ,qu} where s = q\. Represent state qi in unary notation by the string 
V . The halt state h is represented by the empty string. 

(b) Let (q, z) be the value of the next-state function when M is in state p reading a under 
its tape head; that is, S(p, a) = (q, z). Represent (q, z) by the string < z^q > in which q is 
represented in unary and z G {0, 1,0, L, R}. Ifq = h, the value of the next-state function is 

(c) For p G Q, the three values < z'#q' >, < z"#q" >, and < z"'#q'" > of5(p, 0), 
S(p, 1), andS{p,0) are assembled as a triple [< z'#q' X z"#q" X z'"#q"' >}. The 
complete description of the next-state function S is given as a sequence of such triples, one for each 
state p G Q. 

To illustrate this definition, consider the two TMs whose next-state functions are shown in 
Fig. 5.3. The first moves across the non-blank initial string on its tape and halts over the first 
blank symbol. The second moves the input string right one position and inserts a blank to its 
left. The canonical encoding of the first TM is [< R# 1 > < R# 1 > < /?# >] whereas that 
of the second is 

[</3#ll> </?#lll> </3#>] 

[< R# 1 1 1 1 > < R# 1 1 1 1 > <R#1111>] 

[< r# urn > < r#iiiii > < r# urn >] 

[<0#11> <0#111> < 0# >] 

[<1#11> <1#111> <1#>] 

It follows that the canonical encodings of TMs are a subset of the strings defined by the 
regular expression ([(< {0, 1,0, L, R}#1* >) 3 ])* which a TM can analyze to insure that for 
each state and tape letter there is a valid action. 

A universal Turing machine (UTM) U is a Turing machine that is capable of simulating 
an arbitrary Turing machine on an arbitrary input word w. The construction of a UTM based 
on the simulation of the random-access machine is described in Section 3.8. Here we describe 
a direct construction of a UTM. 

Let the UTM U have a 20-letter alphabet A containing the 10 symbols in A plus another 
10 symbols that are marked copies of the symbols in A. (The marked copies are used to 
simulate multiple tracks on a one-track TM.) That is, we define A as follows: 



A = {<,>,[,],#, o, 1,/3,RL} U {<,>,[,],#, 0,1, /3,R,L} 

To simulate the TM M on the input string w, we place M's canonical encoding, p(M), 
on the tape of the UTM U preceded by and followed by w, as suggested in Fig. 5.10. The 
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Figure 5. 1 The initial configuration of the tape of a universal TM that is prepared to simulate 
the TM M on input w. The left end-of-tape marker is the blank symbol fi. 



first letter of w follows the rightmost bracket, ], and is marked by replacing it with its marked 
equivalent, W\ . The current state q of M is identified by replacing the left bracket, [, in q's 
triple by its marked equivalent, [. U simulates AI by reading the marked input symbol a, 
the one that resides under Ms simulated head, and advancing its own head to the triple to 
the right of [ that corresponds to a. (Before it moves its head, it replaces [ with [.) That is, it 
advances its head to the first, second, or third triple associated with the current state depending 
on whether a is 0, 1, or /3. It then changes < to <, moves to the symbol following < and takes 
the required action on the simulated tape. If the action requires writing a symbol, it replaces a 
with a new marked symbol. If it requires moving Al's head, the marking on a is removed and 
the appropriate adjacent symbol is marked. U returns to < and removes the mark. 

The UTM U moves to the next state as follows. It moves its head three places to the 
right of < after changing it to <, at which point it is to the right of #, over the first digit 
representing the next state. If the symbol in this position is >, the next state is h, the halting 
state, and the UTM halts. If the symbol is 1, U replaces it with 1 and then moves its head 
left to the leftmost instance of [ (the leftmost tape cell contains /3, an end-of tape marker). It 
marks [ and returns to 1 . It replaces 1 with 1 and moves its head right one place. If U finds the 
symbol 1 , it marks it, moves left to [, restores it to [ and then moves right to the next instance 
of [ and marks it. It then moves right to 1 and repeats this operation. However, if the UTM 
finds the symbol >, it has finished updating the current state so it moves right to the marked 
tape symbol, at which point it reads the symbol under M's head and starts another transition 
cycle. The details of this construction are left to the reader. (See Problem 5.15.) 



5.6 Encodings of Strings and Turing Machines 

Given an alphabet A with an ordering of its letters, strings over this alphabet have an order 
known as the standard lexicographical order, which we now define. In this order, strings of 
length n — \ precede strings of length n. Thus, if A = {0, 1,2}, 201 < 0001. Among the 
strings of length n, if a and b are in A and a < b, then all strings beginning with a precede 
those beginning with b. For example, if0< 1 < 2 in „4 = {0, 1, 2}, then 022 < 200. If two 
strings of length n have the same prefix u, the ordering between them is determined by the 
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order of the next letter. For example, for the alphabet A and the ordering given on its letters, 
201021 < 201200. 

A simple algorithm produces the strings over an alphabet in lexicographical order. Strings 
of length 1 are produced by enumerating the letters from the alphabet in increasing order. 
Strings of length n are enumerated by choosing the first letter from the alphabet in increasing 
order. The remaining n — \ letters are generated in lexicographical order by applying this 
algorithm recursively on strings of length n — 1 . 

To prepare for later results, we observe that it is straightforward to test an arbitrary string 
over the alphabet A given in Definition 5.5.1 to determine if it is a canonical description p(M) 
of a Turing machine M. Each must be contained in ([(< {0, 1, /?, L, R}#1* >) 3 ])* and have 
a transition for each state and tape letter. If a putative encoding is not canonical, we associate 
with it the two-state null TM T lm \\ with next-state function satisfying S(s, a) = (h, a) for all 
tape letters a. This encoding associates a Turing machine with each string over the alphabet A. 

We now show how to identify the Jth Turing machine, Mj. Given an order to the 
symbols in A, strings over this alphabet are generated in lexicographical order. We define the 
null TM to be the zeroth TM. Each string over A that is not a canonical encoding is associated 
with this machine. The first TM is the one described by the lexicographically first string over 
A that is a canonical encoding. The second TM is described by the second canonical encoding, 
etc. Not only does a TM determine which string is a canonical encoding, but when combined 
with an algorithm to generate strings in lexicographical order, this procedure also assigns a 
Turing machine to each string and allows the jth Turing machine to be found. 

Observe that there is no loss in generality in assuming that the encodings of Turing ma- 
chines are binary strings. We need only create a mapping from the letters in the alphabet A 
to binary strings. Since it may be necessary to use marked letters, we can assume that the 20 
strings in A are available and are encoded into 5-bit binary strings. This allows us to view 
encodings of Turing machines as binary strings but to speak of the encodings in terms of the 
letters in the alphabet A. 



5.7 Limits on Language Acceptance 



A language L that is decidable (also called recursive) has an algorithm, a Turing machine 
that halts on all inputs and accepts just those strings in L. A language for which there is a 
Turing machine that accepts just those strings in L, possibly not halting on strings not in L, 
is recursively enumerable. A language that is recursively enumerable but not decidable is 
unsolvable. 

We begin by describing some decidable languages and then exhibit a language, C\, that 
is not recursively enumerable (no Turing machine exists to accepts strings in it) but whose 
complement, £21 is recursively enumerable but not decidable; that is, £2 is unsolvable. We use 
the language Cj to show that other languages, including the halting problem, are unsolvable. 

5.7.1 Decidable Languages 

Our first decidable problem is the language of pairs of regular expressions and strings such that 
the regular expression describes a language containing the corresponding string: 

£rx = {R> w I w is in the language described by the regular expression R} 
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THEOREM 5.7. 1 The language £r.x is decidable. 

Proof To decide on a string R, w, use the method of Theorem 4.4.1 to construct a NFSM 
M\ that accepts the language described by 7?. Then invoke the method of Theorem 4.2.1 
to construct a DFSM A/2 accepting the same language as M\. The string w is given to AI2, 
which accepts it if R can generate it and rejects it otherwise. This procedure decides £rx 
because it halts on all strings R, w, whether in £rx or not. ■ 

As a second example, we show that finite-state machines that recognize empty languages 
are decidable. Here an FSM encoded as Turing machine reads one input from the tape per 
step and makes a state transition, halting when it reaches the blank letter. 

THEOREM 5.7.2 The language L = {p(M) \ M is a DFSM and L{M) = 0} is decidable. 

Proof L(M) is not empty if there is some string w it can accept. To determine if there 
is such a string, we use a TM AV that executes a breadth-first search on the graph of the 
DFSM AI that is provided as input to M . M first marks the initial state of AI and then 
repeatedly marks any state that has not been marked previously and can be reached from a 
marked state until no additional states can be marked. This process terminates because M 
has a finite number of states. Finally, M' checks to see if there is a marked accepting state 
that can be reached from the initial state, rejecting the input p(M) if so and accepting it if 
not. ■ 

The third language describes context-free grammars generating languages that are empty. 
Here we encode the definition of a context-free grammar G as a string p(G) over a small 
alphabet. 

THEOREM 5.7.3 The language L = {p{G) \ G is a CFG and L(G) = 0} is decidable. 

Proof We design a TM AI' that, when given as input a description p(G) of a CFG G, 
first marks all the terminals of the grammar and then scans all the rules of the grammar, 
marking non-terminal symbols that can be replaced by some marked symbols. (If there is a 
non-terminal A that it is not marked and there is a rule A — > BCD in which B, C, D have 
already been marked, then the TM also marks A.) We repeat this procedure until no new 
non-terminals can be marked. This process terminates because the grammar G has a finite 
number of non-terminals. If 5 is not marked, we accept p(G). Otherwise, we reject p(G) 
because it is possible to generate a string of terminals from S. ■ 

5.7.2 A Language That Is Not Recursively Enumerable 

Not unexpectedly, there are well-defined languages that are not recursively enumerable, as we 
show in this section. We also show that the complement of a decidable language is decidable. 
This allows us to exhibit a language that is recursively enumerable but undecidable. 

Consider the language C\ defined below. It contains the ith binary input string if it is not 
accepted by the ith Turing machine. 

C\ = {wi | Wi is not accepted by Mj} 

THEOREM 5.7.4 The language C\ is not recursively enumerable; that is, no Turing machine exists 
that can accept all the strings in this language. 
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Figure 5.1 I A table whose rows and columns are indexed by input strings and Turing ma- 
chines, respectively. Here Wi is the ith input string and p(Mj) is the encoding of the jth Turing 
machine. The entry in row i, column j indicates whether or not Mj accepts Wi. The language 
C\ consists of input strings Wj for which the entry in the Jth row and Jth column is reject. 



Proof We use proof by contradiction; that is, we assume the existence of a TM Mfc that 
accepts C\. If Wk is in C\, then Mfc accepts it, contradicting the definition of C\. This 
implies that w^ is not in C\. On the other hand, if Wk is not in C\, then it is not accepted 
by Mfc. It follows from the definition of C\ that Wfc is in C\. Thus, w^ is in L\ if and only 
if it is not in C\. We have a contradiction and no Turing machine accepts C\. ■ 

This proof uses diagonalization. (See Fig. 5.11.) In effect, we construct an infinite two- 
dimensional matrix whose rows are indexed by input words and whose columns are indexed 
by Turing machines. The entry in row i and column j of this matrix specifies whether or not 
input word Wi is accepted by Mj. The language C\ contains those words Wj that Mj rejects, 
that is, it contains row indices (words) for which the word "reject" is found on the diagonal. 
If we assume that some TM, Mfc, accepts C\, we have a problem because we cannot decide 
whether or not lOfc is in C\. Diagonalization is effective in ruling out the possibility of solving 
a computational problem but has limited usefulness on problems of bounded size. 

5.7.3 Recursively Enumerable but Not Decidable Languages 

We show the existence of a language that is recursively enumerable but not decidable. Our 
approach is to show that the complement of a recursive language is recursive and then exhibit 
a recursively enumerable language £2 whose complement C\ is not recursively enumerable: 

£2 = {"Wi I Wi is accepted by Mi} 

THEOREM 5.7.5 The complement of a decidable language is decidable. 

Proof Let I be a recursive language accepted by a Turing machine M\ that halts on all 
input strings. Relabel the accepting halt state of M\ as non-accepting and all non-accepting 
halt states as accepting. This produces a machine Mj that enters an accepting halt state only 
when M\ enters a non-accepting halt state and vice versa. We convert this non-standard 
machine to standard form (having one accepting halt state) by adding a new accepting halt 
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state and making a transition to it from all accepting halt states. This new machine halts on 
all inputs and accepts the complement of L. ■ 

THEOREM 5.7.6 The language Li is recursively enumerable but not decidable. 

Proof To establish the desired result it suffices to exhibit a Turing machine M that accepts 
each string in C2, because the complement C2 = C\, which is not recursively enumerable, 
as shown above. 

Given a string x in B* , let M enumerate the input strings over the alphabet B of £2 
until it finds x. Let x be the ith string where i is recorded in binary on one of M's tapes. 
The strings over the alphabet A used for canonical encodings of Turing machines are enu- 
merated and tested to determine whether or not they are canonical encodings, as described 
in Section 5.6. When the encoding p(Mi) of the ith Turing machine is discovered, Mj is 
simulated with a universal Turing machine on the input string x. This universal machine 
will halt and accept the string x if it is in £2- Thus, Cj is recursively enumerable. ■ 



5.8 Reducibility and Unsolvability 



In this section we show that there are many languages that are unsolvable (undecidable). In the 
previous section we showed that the language £2 is unsolvable. To show that a new problem 
is unsolvable we use reducibility: we assume an algorithm A exists for a new language L and 
then show that we can use A to obtain an algorithm for a language previously shown to be 
unsolvable, thereby contradicting the assumption that algorithm A exists. 

We begin by introducing reducibility and then give examples of unsolvable languages. 
Many interesting languages are unsolvable. 

5.8.1 Reducibility 

A new language <C n ew can often be shown unsolvable by assuming it is solvable and then 
showing this implies that an older language £ id is solvable, where £ id has been previously 
shown to be unsolvable. Since this contradicts the facts, the new language cannot be solvable. 
This is one application of reducibility. The formal definition of reducibility is given below 
and illustrated by Fig. 5.12. 

DEFINITION 5.8. 1 The language L\ is reducible to the language L2 if there is an algorithm 
computing a total function f : C* 1— > T>* that translates each string w over the alphabet C ofL\ 
into a string z = f(w) over the alphabet T> of L 2 such that w G L\ if and only if Z € L 2 - 

In this definition, testing for membership of a string w in L\ is reduced to testing for 
membership of a string z in L2, where the latter problem is presumably a previously solved 
problem. It is important to note that the latter problem is no easier than the former, even 
though the use of the word "reduce" suggests that it is. Rather, reducibility establishes a link 
between two problems with the expectation that the properties of one can be used to deduce 
properties of the other. For example, reducibility is used to identify NP-complete problems. 
(See Sections 3.9.3 and 8.7.) 
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Mx) = 4> 2 (f(x)) 



Figure 5.12 The characteristic function (j>i of Li, i = 1, 2 has value 1 on strings in L( and 
otherwise. Because the language L\ is reducible to the language Lj, there is a function / such 
that for all x, (f>i (x) = 4>2(f{x)). 



Reducibility is a fundamental idea that is formally introduced in Section 2.4 and used 
throughout this book. Reductions of the type defined above are known as many-to-one re- 
ductions. (See Section 8.7 for more on this subject.) 

The following lemma is a tool to show that problems are unsolvable. We use the same 
mechanism in Chapter 8 to classify languages by their use of time, space and other computa- 
tional resources. 

LEMMA 5.8.1 Let L\ be reducible to L 2 . If L 2 is decidable, then L\ is decidable. If L\ is 
unsolvable and L2 is recursively enumerable, L 2 is also unsolvable. 

Proof Let T be a Turing machine implementing the algorithm that translates strings over 
the alphabet of L\ to strings over the alphabet of L 2 . If L 2 is decidable, there is a halting 
Turing machine M 2 that accepts it. A multi-tape Turing machine M\ that decides L\ can 
be constructed as follows: On input string w, M\ invokes T to generate the string z, which 
it then passes to M 2 . If M 2 accepts z, M\ accepts w. If M 2 rejects it, so does M\. Thus, 
Mi decides L\. 

Suppose now that L\ is unsolvable. Assuming that L 2 is decidable, from the above con- 
struction, L\ is decidable, contradicting this assumption. Thus, L 2 cannot be decidable. ■ 

The power of this lemma will be apparent in the next section. 

5.8.2 Unsolvable Problems 

In this section we examine six representative unsolvable problems. They range from the classi- 
cal halting problem to Rice's theorem. 

We begin by considering the halting problem for Turing machines. The problem is to 
determine for an arbitrary TM M and an arbitrary input string x whether M with input x 
halts or not. We characterize this problem by the language Ch shown below. We show it is 
unsolvable, that is, Ch is recursively enumerable but not decidable. No Turing machine exists 
to decide this language. 

Ch = {p(M), w J M halts on input w} 



228 



Chapter 5 Computability 



Models of Computation 



THEOREM 5.8.1 The Language Ch is recursively enumerable but not decidable. 

Proof To show that Ch is recursively enumerable, pass the encoding p(M) of the TM M 
and the input string w to the universal Turing machine U of Section 5-5. This machine 
simulates M and halts on the input w if and only if M halts on w. Thus, Ch is recursively 
enumerable. 

To show that Ch is undecidable, we assume that Ch is decidable by a Turing machine 
Mh and show a contradiction. Using Mh we construct a Turing machine M* that decides 
the language C* = {p(M),w | w is not accepted by M}. M* simulates Mh on p(M),w 
to determine whether M halts or not on w. If Mh says that M does not halt, M* accepts 
w. If Mh says that M does halt, M* simulates M on input string w and rejects w if M 
accepts it and accepts w if M rejects it. Thus, if Ch is decidable, so is C* . 

The procedures described in Section 5.6 can be used to design a Turing machine M* 
that determines for which integer i the input string w is lexicographically the ith string, Wi, 
and also produce the description p(Mi) of the ith Turing machine Mj. 

To decide C\ we use M* to translate an input string w = Wi to the string p(Mj), Wi. 
Given the presumed existence of M* , we can decide C\ by deciding C* . However, by 
Theorem 5.7.4, C\ is not decidable (it is not even recursively enumerable). Thus, C* is not 
decidable which implies that Ch is also not decidable. ■ 

The second unsolvable problem we consider is the empty tape acceptance problem: given 
a Turing machine M , we ask if we can tell whether it accepts the empty string. We reduce the 
halting problem to it. (See Fig. 5.13.) 

Cet = {pO^O I L(M) contains the empty string} 

THEOREM 5.8.2 The language Cet is not decidable. 

Proof To show that Cet is not decidable, we assume that it is and derive a contradiction. 
The contradiction is produced by assuming the existence of a TM Met that decides Cet 
and then showing that this implies the existence of a TM Mh that decides Cr. 

Given an encoding p(M) for an arbitrary TM M and an arbitrary input w, the TM 
Mh constructs a TM T(M, w) that writes w on the tape when the tape is empty and 
simulates M on w, halting if M halts. Thus, T{M, w) accepts the empty tape if M halts 
on w. Mh decides Cu by constructing an encoding of T(M, w) and passing it to Met- 
(See Fig. 5.13.) The language accepted by T(M, w) includes the empty string if and only 
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Figure 5. 1 3 Schematic representation of the reduction from Ch to Cet- 



©John E Savage 5.8 Reducibility and Unsolvability 229 

if M halts on w. Thus, Mh decides the halting problem, which as shown earlier cannot be 
decided. ■ 

The third unsolvable problem we consider is the empty set acceptance problem: Given a 
Turing machine, we ask if we can tell if the language it accepts is empty. We reduce the halting 
problem to this language. 

£el = {p(M) | L(M) = 0} 

THEOREM 5.8.3 The language £el is not decidable. 

Proof We reduce Ch to £el> assume that Cel is decidable by a TM Mel. and then show 
that a TM Mff exists that decides Ch, thereby establishing a contradiction. 

Given an encoding p(M) for an arbitrary TM M and an arbitrary input w, the TM 
Mh constructs a TM T(M, w) that accepts the string placed on its tape if it is w and M 
halts on it; otherwise it enters an infinite loop. Mh can implement T(M, w) by entering an 
infinite loop if its input string is not w and otherwise simulating M on w with a universal 
Turing machine. 

It follows that L(T(M, w)) is empty if M does not halt on w and contains w if it does 
halt. Under the assumption that Mel decides Cel, Mh can decide £jj by constructing 
T{M, w) and passing it to Mel. which accepts p{T{M, w)) if M does not halt on w and 
rejects it if M does halt. Thus, Mh decides Ch, a contradiction. ■ 

The fourth problem we consider is the regular machine recognition problem. In this 
case we ask if a Turing machine exists that can decide from the description of an arbitrary 
Turing machine M whether the language accepted by M is regular or not: 

C R = {p(M) | L(M) is regular} 

THEOREM 5.8.4 The language Cr is not decidable. 

Proof We assume that a TM Mr exists to decide Cr and show that this implies the exis- 
tence of a TM Mh that decides Ch, a contradiction. Thus, Mr cannot exist. 

Given an encoding p(M) for an arbitrary TM M and an arbitrary input w, the TM 
Mh constructs a TM T(M, w) that scans its tape. If it finds a string in {0™1™ | n > 0}, it 
accepts it; if not, T(M, w) erases the tape and simulates M on w, halting only if M halts 
on w. Thus, T(Al,w) accepts all strings in B* if M halts on w but accepts only strings 
in {0"l n | n > 0} otherwise. Thus, T{M, w) accepts the regular language B* if M halts 
on w and accepts the context-free language {0 n l" | n > 0} otherwise. Thus, Mh can be 
implemented by constructing T(M, w) and passing it to Mr, which is presumed to decide 
Cr. ■ 

The fifth problem generalizes the above result and is known as Rice's theorem. It says that 
no algorithm exists to determine from the description of a TM whether or not the language it 
accepts falls into any proper subset of the recursively enumerable languages. 

Let RE be the set of recursively enumerable languages over B. For each set C that is a 
proper subset of RE, define the following language: 

C c = { P {M) | L{M) e C} 

Rice's theorem says that, for all C such that C =/= and C C RE, the language Cc defined above 
is undecidable. 



230 Chapter 5 Computability Models of Computation 

THEOREM 5.8.5 (Rice) LetC C RE, C ^ 0. The language C e is not decidable. 

Proof To prove that Cq is not decidable, we assume that it is decidable by the TM KIq and 
show that this implies the existence of a TM Mh that decides Ch, which has been shown 
previously not to exist. Thus, Mq cannot exist. 

We consider two cases, the first in which B* is in not C and the second in which it is in 
C. In the first case, let L be a language in C. In the second, let L be a language in RE — C. 
Since C is a proper subset of RE and not empty, there is always a language L such that one 
of L and B* is in C and the other is in its complement RE — C. 

Given an encoding p(M) for an arbitrary TM M and an arbitrary input w, the TM 
Mh constructs a (four-tape) TM T(M, w) that simulates two machines in parallel (by al- 
ternatively simulating one step of each machine). The first, Mo, uses a phrase-structure 
grammar for L to see if T(M , w)'s input string x is in L; it holds x on one tape, holds the 
current choice inputs for the NDTM Ml of Theorem 5.4.2 on a second, and uses a third 
tape for the deterministic simulation of M^. (See the comments following Theorem 5.4.2.) 
T(Al,w) halts if Mq generates x. The second TM writes w on the fourth tape and sim- 
ulates M on it. T(M,w) halts if M halts on w. Thus, T(M,w) accepts the regular 
language B* if M halts on w and accepts L otherwise. Thus, Mh can be implemented by 
constructing T(M, w) and passing it to Mc, which is presumed to decide Cc- ■ 

Our last problem is the self-terminating machine problem. The question addressed is 
whether a Turing machine M given a description p{M) of itself as input will halt or not. The 
problem is defined by the following language. We give a direct proof that it is undecidable; 
that is, we do not reduce some other problem to it. 

Cst = {p(M) I M is self-terminating} 

THEOREM 5.8.6 The language Cst is recursively enumerable but not decidable. 

Proof To show that Cst is recursively enumerable we exhibit a TM T that accepts strings 
in Cst- T makes a copy of its input string p(M) and simulates M on p(M) by passing 
(p(M), p{M)) to a universal TM that halts and accepts p(M) if it is in Cst- 

To show that Cst is not decidable, we assume that it is and arrive at a contradiction. 
Let Mst decide Cst- We design a TM M* that does the following: M * simulates Mst on 
the input string w. If M$t halts and accepts w, M* enters an infinite loop. If Afgx halts 
and rejects w, M* accepts w. (AfgT halts on all inputs.) 

The new machine M* is either self-terminating or it is not. If AI* is self-terminating, 
then on input p(M*), which is an encoding of itself, M* enters an infinite loop because 
Mst detects that it is self-terminating. Thus, AI* is not self-terminating. On the other 
hand, if M* is not self- terminating, on input p(M*) it halts and accepts p(M*) because 
Mst detects that it is not self-terminating and enters the rejecting halt state. But this con- 
tradicts the assumption that M* is not self-terminating. Since we arrive at a contradiction 
in both cases, the assumption that Cst is decidable must be false. ■ 

5.9 Functions Computed by Turing Machines 

In this section we introduce the partial recursive functions, a family of functions in which 
each function is constructed from three basic function types, zero, successor, and projection, 
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and three operations on functions, composition, primitive recursion, and minimalization. Al- 
though we do not have the space to show this, the functions computed by Turing machines are 
exactly the partial recursive functions. In this section, we show one half of this result, namely, 
that every partial recursive function can be encoded as a RAM program (see Section 3.4.3) that 
can be executed by Turing machines. 

We begin with the primitive recursive functions then describe the partial recursive func- 
tions. We then show that partial recursive functions can be realized by RAM programs. 

5.9.1 Primitive Recursive Functions 

Let IN = {0, 1, 2, 3, . . .} be the set of non-negative integers. The partial recursive functions, 
f : IN™ i— > IN m , map n-tuples of integers over ]N to m-tuples of integers in ]N for arbitrary 
n and m. Partial recursive functions may be partial functions. They are constructed from 
three base function types, the successor function S : IN i— > IN, where S(x) = X + 1, 
the predecessor function P : IN i— > IN, where P(x) returns either if x = or the 
integer one less than x, and the projection functions t/? : IN" i— > IN, 1 < j < n, where 
U^(xi, X2, • ■ ■ , x n ) = Xj. These basic functions are combined using a finite number of 
applications of function composition, primitive recursion, and minimalization. 

Function composition is studied in Chapters 2 and 6. A function / : IN™ i— > IN of n 
arguments is defined by the composition of a function g : IN™ \— > IN of m arguments with 
m functions f\ : IN™ i— > IN, f 2 : IN™ i— > IN, . . . , f m : IN™ t— > N, each of n arguments, as 
follows: 

f{x l ,X 2 , ...,X n ) = g(fx (x u X 2 , ■■-, X n ), ■■■, fm{xi,X 2 , ..„!„)) 

A function / : IN" +1 i— > IN of n + 1 arguments is defined by primitive recursion from a 
function g : IN™ i— > IN of n arguments and a function h : IN™ +2 i— > IN on n + 2 arguments 
if and only if for all values of x\, x 2 , . . . , x n and y in IN: 

f(xi,x 2 ,...,x n ,0) = g(x u x 2 ,.. .,x n ) 
f{x u x 2 , ...,x n ,y+l) = h(x u x 2 , ...,x n ,y, f(x u x 2 , . . . , x n , y)) 

In the above definition if n = 0, we adopt the convention that the value of / is a constant. 
Thus, f(x\, x 2 , • . • , x n , k) is defined recursively in terms of h and itself with k replaced by 
k — \ unless k = 0. 

DEFINITION 5.9. 1 The class of primitive recursive functions is the smallest class of functions 
that contains the base functions and is closed under composition and primitive recursion. 

Many functions of interest are primitive recursive. Among these is the zero function 
Z : IN i— > IN, where Z(x) = 0. It is defined by primitive recursion by Z(0) = and 

Z(x+l) = U 2 (x,Z{x)) 

Other important primitive recursive functions are addition, subtraction, multiplication, and 
division, as we now show. Let / a dd : IN 2 <—> IN, / su b : IN 2 \— > IN, / mu it : IN 2 l—> IN> and 
/div : IN i— > IN denote integer addition, subtraction, multiplication, and division. 

For the integer addition function / a dd introduce the function hi : IN 3 m N on three 
arguments, where h\ is defined below in terms of the successor and projection functions: 

hi{x\,x 2 , X3) = S(Ul{x\,x 2 , x-i)) 
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Then, h\(x\, x 2 , #3) = X3 + 1. Now define f&dd{x> v) using primitive recursion, as follows: 

/add O,0) = U\(x) 

/add(a;,y+ 1) = hi(x,y,f &d d{x,y)) 

The role of h is to carry the values of x and y from one recursive invocation to another. To 
determine the value of f a dd{x, y) from this definition, if y = 0, fa,dd{x,y) = x. If y > 0, 
/ a dd(^> 2/) = ^i(#) 2/ — 1, fadd(x, y — 1))- This in turn causes other recursive invocations of 
/add- The infix notation + is used for / add ; that is, f a dd(x, y) = x + y. 

Because the primitive recursive functions are defined over the non-negative integers, the 
subtraction function f su b(x,y) must return the value if y is larger than x, an operation 
called proper subtraction. (Its infix notation is — and we write f su b(x, y) = x — y.) It is 
defined as follows: 



f suh (x,0) = U 1 l (x) 
fsub{x,y+ 1) = U^(x,y,P{f suh {x,y))) 



The value of f su b(x, y) is x My = and is the predecessor of f su b(x, y — 1) otherwise. 

The integer multiplication function, / mu it> is defined in terms of the function h 2 : 

N 3 h-> N: 

h 2 (x u x 2 ,x 3 ) = f a dd(Ui{x 1 ,x 2 ,x 3 ), Ul(xi,X 2 ,X 3 )) 

Using primitive recursion, we have 

/mult O,0) = Z(x) 
fmu\t(x,y+ 1) = h 2 (x,y,f niu \ t (x,yj) 

The value of / mu it(a;, jy) is zero if y = and otherwise is the result of adding x to itself y 
times. To see this, note that the value of h 2 is the sum of its first and third arguments, x and 
fmuit(x,y). On each invocation of primitive recursion the value of y is decremented by 1 
until the value is reached. The definition of the division function is left as Problem 5.26. 

Define the function / s i gn : IN i— > IN so that / s i gn (0) = and / s i gn (^ +1) = 1- To 
show that /sign is primitive recursive it suffices to invoke the projection operator formally. A 
function with value or 1 is called a predicate. 

5.9.2 Partial Recursive Functions 

The partial recursive functions are obtained by extending the primitive recursive functions to 
include minimalization. Minimalization defines a function / : IN" i— > IN in terms of a 
second function g : IN n+1 i— > IN by letting f(x) be the smallest integer y G IN such that 
g(x, y) = and g(x, z) is defined for all z < y, z € IN. Note that if g(x, z) is not defined 
for all z < y, then f(x) is not defined. Thus, minimalization can result in partial functions. 

DEFINITION 5.9.2 The set q/partial recursive functions is the smallest set of functions contain- 
ing the base functions that is closed under composition, primitive recursion, and minimalization. 

A partial recursive function that is defined for all points in its domain is called a recursive 
function. 
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5.9.3 Partial Recursive Functions are RAM-Computable 

There is a nice correspondence between RAM programs and partial recursive functions. The 
straight-line programs result from applying composition to the base functions. Adding primi- 
tive recursion corresponds to adding for-loops whereas adding minimilization corresponds to 
adding while loops. 

It is not difficult to see that every partial recursive function can be described by a program 
in the RAM assembly language of Section 3.4.3. For example, to compute the zero function, 
Z(x), it suffices for a RAM program to clear register Rj. To compute the successor function, 
S(x), it suffices to increment register Rj. Similarly, to compute the projection function [/", 
one need only load register Ri with the contents of register Rj . Function composition it is 
straightforward: one need only insure that the functions fj, 1 < j < m, deposit their values 
in registers that are accessed by g. Similar constructions are possible for primitive recursion 
and minimalization. (See Problems 5.29, 5.30, and 5.31.) 



Problems 

THE STANDARD TURING MACHINE MODEL 

5.1 Show that the standard Turing machine model of Section 5.1 and the model of Sec- 
tion 3.7 are equivalent in that one can simulate the other. 

PROGRAMMING THE TURING MACHINE 

5.2 Describe a Turing machine that generates the binary strings in lexicographical order. 
The first few strings in this ordering are 0, 1, 00, 01, 10, 1 1, 000, 001, .... 

5.3 Describe a Turing machine recognizing {x % y 3 X \ i, j, k > 1 and k = i ■ j}. 

5.4 Describe a Turing machine that computes the function whose value on input tfb 3 is 
c k , where k = i ■ j. 

5-5 Describe a Turing machine that accepts the string (u, v) if it is a substring of v. 

5.6 The element distinctness language, L c( j, consists of binary strings no two of which 
are the same; that is, L c( j = {2u>i2 . . . Iw^l \ Wj g B* and lUj ^ Wj, for i ^ j}. 
Describe a Turing machine that accepts this language. 

EXTENSIONS TO THE STANDARD TURING MACHINE MODEL 

5-7 Given a Turing machine with a double-ended tape, show how it can be simulated by 
one with a single-ended tape. 

5-8 Show equivalence between the standard Turing machine and the one-tape double- 
headed Turing machine with two heads that can move independently on its one tape. 

5.9 Show that a pushdown automaton with two pushdown tapes is equivalent to a Turing 
machine. 

5.10 Figure 5.14 shows a representation of a Turing machine with a two-dimensional tape 
whose head can move one step vertically or horizontally. Give a complete definition of 
a two-dimensional TM and sketch a proof that it can be simulated by a standard TM. 
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Figure 5. 1 4 A schematic representation of a two-dimensional Turing machine. 



5.11 By analogy with the construction given in Section 3.9.7, show that every deterministic 
T-step multi-tape Turing machine computation can be simulated on a two-tape Turing 
machine in 0(T log T) steps. 

PHRASE-STRUCTURE LANGUAGES AND TURING MACHINES 

5.12 Give a detailed design of a Turing machine recognizing {a n b n c n \ n > 1}. 

5.13 Use the method of Theorem 5.4.1 to construct a phrase-structure grammar generating 
{a n b n c n \n> 1}. 

5.14 Design a Turing machine recognizing the language {0 | i > 1}. 

UNIVERSAL TURING MACHINES 

5.15 Using the description of Section 5.5, give a complete description of a universal Turing 
machine. 

5.16 Construct a universal TM that has only two non-accepting states. 

DECIDABLE PROBLEMS 

5.17 Show that the following languages are decidable: 

a) L = {p(M), w | M is a DFSM that accepts the input string w} 

b) L = {p{M) | M is a DFSM and L{M) is infinite} 

5.18 The symmetric difference between sets A and B is defined by (A — B) U (B — A), 
where A — B = A C\ B. Use the symmetric difference to show that the following 
language is decidable: 



£eq_fsm = {p(M\)> piM-i) | M\ and M2 are FSMs recognizing the same language} 
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5.19 Show that the following language is decidable: 

L = {p{G), w | p{G) encodes a CFG G that generates w} 

Hint: How long is a derivation ofw if G is in Chomsky normal form? 

5.20 Show that the following language is decidable: 

L = {p{G) | p{G) encodes a CFG G for which L(G) + 0} 

5.21 Let L\, L 2 € P where P is the class of polynomial-time problems (see Definition 3.7.2). 
Show that the following statements hold: 

a) L\ U L 2 e P 

b) L\L 2 € P, where L\L 2 is the concatenation of L\ and L 2 

c) 17 gp 

5.22 Let Li e P. Show that L\ G P. 

Hint: Try using dynamic programming, the algorithmic concept illustrated by the 
parsing algorithm of Theorem 4.1 1.2. 

UNSOLVABLE PROBLEMS 

5.23 Show that the problem of determining whether an arbitrary TM starting with a blank 
tape will ever halt is unsolvable. 

5.24 Show that the following language is undecidable: 

Leq = {p(M x ),p(M 2 ) | L{M X ) = L(M 2 )} 

5.25 Determine which of the following problems are solvable and unsolvable. Defend your 
conclusions. 

a) {p(M), w,p | M reaches state p on input w from its initial state} 

b) {p(M),p | there is a configuration [u\ ■ . . u m qv\ . . . v n ] yielding a configuration 
containing state p} 

c) {p(AI), a | M writes character a when started on the empty tape} 

d) {p(M) | M writes a non-blank character when started on the empty tape} 

e) { p(AI), iv \ on input w M moves its head to the left} 

FUNCTIONS COMPUTED BY TURING MACHINES 

5.26 Define the integer division function /div : IN i— > ]N using primitive recursion. 

5.27 Show that the function / r0 main : IN 2 i— > ]N that provides the remainder of x after 
division by y is a primitive recursive function. 

5.28 Show that the factorial function x\ is primitive recursive. 

5.29 Write a RAM program (see Section 3.4.3) to realize the composition operation. 

5.30 Write a RAM program (see Section 3.4.3) to realize the primitive recursion operation. 

5.31 Write a RAM program (see Section 3.4.3) to realize the minimalization operation. 
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Chapter Notes 

Alan Turing introduced the Turing machine, gave an example of a universal machine and 
demonstrated the unsolvability of the halting problem in [338]. A similar model was inde- 
pendently developed by Post [255]. Chomsky [69] demonstrated the equivalence of phrase- 
structure languages. Rice's theorem is presented in [280] . 

Church gave a formal model of computation in [72]. The equivalence between the partial 
recursive functions and the Turing computable functions was shown by Kleene [168]. 

For a more extensive introduction to Turing machines, see the books by Hopcroft and 
Ullman [141] and Lewis and Papadimitriou [200]. 
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Algebraic and Combinatorial 

Circuits 



Algebraic circuits combine operations drawn from an algebraic system. In this chapter we de- 
velop algebraic and combinatorial circuits for a variety of generally non-Boolean problems, in- 
cluding multiplication and inversion of matrices, convolution, the discrete Fourier transform, 
and sorting networks. These problems are used primarily to illustrate concepts developed in 
later chapters, so that this chapter may be used for reference when studying those chapters. 

For each of the problems examined here the natural algorithms are straight-line and the 
graphs are directed and acyclic; that is, they are circuits. Not only are straight-line algorithms 
the ones typically used for these problems, but in some cases they are the best possible. 

The quality of the circuits developed here is measured by circuit size, the number of circuit 
operations, and circuit depth, the length of the longest path between input and output ver- 
tices. Circuit size is a measure of the work necessary to execute the corresponding straight-line 
program. Circuit depth is a measure of the minimal time needed for a problem on a parallel 
machine. 

For some problems, such as matrix inversion, we give serial (large-depth) as well as par- 
allel (small-depth) circuits. The parallel circuits generally require considerably more circuit 
elements than the corresponding serial circuits. 
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6.1 Straight-Line Programs 



Straight-line programs (SLP) are defined in Section 2.2. Each SLP step is an input, compu- 
tation, or output step. The notation (s READ x) indicates that the sth step is an input step on 
which the value x is read. The notation (s OUTPUT i) indicates that the result of the ith step 
is to be provided as output. Finally, the notation (s OF i ... k) indicates that the sth step 
computes the value of the operator OP on the results generated at steps i, . . . ,k. We require 
that s > i, . . . , k so that the result produced at step s depends only on the results produced 
at earlier steps. In this chapter we consider SLPs in which the inputs and operators have values 
over a set A that is generally not binary. Thus, the circuits considered here are generally not 
logic circuits. The basis Q for an SLP is the set of operators it uses. A circuit is the graph of a 
straight-line program. By its nature this graph is directed and acyclic. 

An example of a straight-line program that computes the fast Fourier transform (FFT) 
on four inputs is given below. (The FFT is introduced in Section 6.7.3.) Here the function 
/+, aifl, b) = a + ba where a is a power of a constant ui that is a principal nth root of unity of 
a commutative ring 1Z. (See Section 6.7.1.) The arguments a and b are variables with values 
in ft. 
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The graph of the above SLP is the familiar FFT butterfly graph shown in Fig. 6.1. As- 
signment statements are associated with vertices of in-degree zero and operator statements are 
associated with other vertices. We attach the name of the operator or variable associated with 
each step to the corresponding vertex in the graph. We often suppress the unique indices of 
vertices, although they are retained in Fig. 6.1. 




a 0,2 a\ 

Figure 6.1 The FFT butterfly graph on four inputs. 
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The function g s is associated with the sth step. The identity function with value v is 
associated with the assignment statement (r READ v). Associated with the computation step 
(s OP i ... k) is the function g s = OP((/i, . . . ,gk), where gt, ■ ■ ■ ,gk are the functions 
computed at the steps on which the sth step depends. If a straight-line program has n inputs 
and m outputs, it computes a function / : A n i— > A m . If Si, s 2 , ■ ■ ., s m are the output steps, 
then / = (g Bl , g Sl , . . . , g Sfn ). The function computed by a circuit is the function computed 
by the corresponding straight-line program. 

In the example above, g u = f+,u 1 (95>97) = 55 + 57^ 2 > where g 5 = f +> u o{g\,gz) = 
ao + a2W° = clq + a>2 and 177 = / +j w ° (53.54) = d\ + a$u)° = a\ + 03. Thus, 

2 2 

5ll = O-O + OlW + CL2 + 03^ 

which is the value of the polynomial p(x) at x = lo 1 when ui = 1: 

p(cc) = a + aiX + a 2 x 2 + a 3 a; 3 

The size of a circuit is the number of operator statements it contains. Its depth is the 
length of (number of edges on) the longest path from an input to an output vertex. The basis 
fl is the set of operators used in the circuit. The size and depth of the smallest and shallowest 
circuits for a function / over the basis Q are denoted Cji(/) and D^(f), respectively. In this 
chapter we derive upper bounds on the size and depth of circuits. 

6.2 Mathematical Preliminaries 

In this section we introduce rings, fields and matrices, concepts widely used in this chapter. 

6.2.1 Rings and Fields 

Rings and fields are algebraic systems that consists of a set with two special elements, and 1, 
and two operations called addition and multiplication that obey a small set of rules. 

DEFINITION 6.2. 1 A ring 1Z is a five-tuple (R, +, *, 0, 1), where R is closed under addition 
+ and multiplication * (that is, + : R 1 1— > R and * : R 2 1— > R) and + and * are associative 
(for alla,b,c S R, a + (b + c) = (a + b) + c and a * (b * c) = (a * b) * c). Also, 0, 1 S R, 
where is the identity under addition (for alia G R, a + = + a = a) and 1 is the identity 
under multiplication (for all a G R, a * 1 = 1 * a = a). In addition, is an annihilator 
under multiplication (for all a G R, a * = * a = 0). Every element ofR has an additive 
inverse for all a G R, there exists an element — a such that (—a) + a = a + (—a) = 0). Finally, 
addition is commutative (for all a,b G R, a + b = b + a) and multiplication distributes over 
addition (for alia, b,c G R, a* (b + c) = (a * b) + (a * c) and (b+ c)*a = (6* a) + (c* a)). 
A ring is commutative if multiplication is commutative (for all a, b G R, a * b = b * a). A field 
is a commutative ring in which each element other than has a multiplicative inverse (for all 
a G R, a 7^ 0, there exists an element a~ l such that a * a~ l = I). 

Let 7L be the set of positive and non-negative integers and let + and * denote integer 
addition and multiplication. Then (7L, +, *, 0, 1) is a commutative ring. (See Problem 6.1.) 
Similarly, the system ({0, 1}, +, *, 0, 1), where + is addition modulo 2 (for all a, b G {0, 1}, 
a + b is the remainder after division by 2 or the EXCLUSIVE OR operation) and * is the AND 
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operation, is a commutative ring, as the reader can show. A third commutative ring is the 
integers modulo p together with the operations of addition and multiplication modulo p. (See 
Problem 6.2.) The ring of matrices introduced in the next section is not commutative. Some 
important commutative rings are introduced in Section 6.7.1. 

6.2.2 Matrices 

A matrix over a set R is a rectangular array of elements drawn from 7? consisting of some 
number m of rows and some number n of columns. Rows are indexed by integers from the set 
{1, 2, 3, ... , m} and columns are indexed by integers from the set {1, 2, 3, . . . , n}. The entry 
in the ith row and jth column of A is denoted atj, as suggested in the following example: 



»1,1 Ol,2 0,1,3 Ol,4 
»2,1 0-2,2 fl2,3 2 ,4 
O3 1 CI3 2 O3 3 $3 4 



12 3 4 
5 6 7 8 
9 10 11 12 



Thus, a 2 ,3 = 7 and 1X3,1 = 9. 

The transpose of a matrix A, denoted A , is the matrix obtained from A by exchanging 
rows and columns, as shown below for the matrix A above: 

15 9 

2 6 10 

3 7 11 

4 8 12 

Clearly, the transpose of the transpose of a matrix A, (A ) , is the matrix A. 

A column n-vector a; is a matrix containing one column and n rows, for example: 



x 



A row m-vector y is a matrix containing one row and m columns, for example: 

y = [y\,V2,---,y m ] = [1,5,..., 9] 

The transpose of a row vector is a column vector and vice versa. 

A square matrix is an n x n matrix for some integer n. The main diagonal of an n x n 
square matrix A is the set of elements {aii, a2,2> ■ ■ ■ , &n-i,n-i, a n ,n}- The diagonal below 
(above) the main diagonal is the elements {a2,i> Q3,2> • ■ • . o n<n -\} ({ai,2> fl2,3> ■ • ■ > fln-i.n})- 
The n x n identity matrix, I n , is a square n x n matrix with value 1 on the main diagonal 
and elsewhere. The n x n zero matrix, n , has value in each position. A matrix is upper 
(lower) triangular if all elements below (above) the main diagonal are 0. A square matrix A is 
symmetric if A = A , that is, Oj a = Oj,% for all 1 < i, j < n. 

The scalar product of a scalar c G R and an n x m matrix A over R, denoted cA, has 
value cciij in row i and column j . 



Xi 
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_ %n _ 




. 8 . 



©John E Savage 



6.2 Mathematical Preliminaries 



241 



The matrix-vector product between an m x n matrix A and a column n-vector x is the 
column m -vector b below: 



b= Ax 



a l,l a l,2 

a 2,l a 2,2 

a m,l a n-l,2 

a m,l Q"n,2 

<Xl,l * ^1 

a 2,l * #1 



a l,n 
a 2,n 



X 



a m— l,n 
Q"m,n 

a\,2 *%2 + 
0-2,2 * x 2 + 



X\ 

x 2 

•En—1 

J- II 



a m - U i*Xx + o m _i )2 *aJ 2 + 
a m ,i *X\ + a m ,2 * x 2 + 



~r a l,n * X n 

+ a 2,ra * ^n 

~r a m— l,n * ^n 



Thus, 6j is defined as follows for 1 < j < n: 



bj = Oj,i * xi + a,i,2 *x 2 -\ h ai, m * x m 

The matrix-vector product between a row m-vector x and an m x n matrix A is the row 
n-vector b below: 

b = [6,] = zA 

where for 1 < i < n hi satisfies 

hi = xi * a 1; j + x 2 * a 2 ,i H hr m * a m ,j 

The special case of a matrix-vector product between a row n-vector, a?, and a column n vector, 
y, denoted x ■ y and defined below, is called the inner product of the two vectors: 



x y 



2 = 1 



Xi *Vi 



If the entries of the n x n matrix A and the column n-vectors x and b shown below are 
drawn from a ring TZ and A and b are given, then the following matrix equation defines a 
linear system of n equations in the n unknowns X: 

Ax = b 

An example of a linear system of four equations in four unknowns is 
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can be expressed as follows: 
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Solving a linear system, when it is possible, consists of finding values for x given values for 
A and b. (See Section 6.6.) 

Consider the set ofmxn matrices whose entries are drawn from a ring 7Z. The matrix 

addition function f^Vg : 7\L 2m " i— > 1Z mn on two raxn matrices A = foi.,-1 and i? = [6, 

generates a matrix C = /K (-4, -B) = ^4 + m ,n -B = [cjj], where 4 
addition operator and Cjj is defined as 



jjj iUUU — |Ujjj 

is the infix matrix 



The straight-line program based on this equation uses one instance of the ring addition op 
erator - 1 - ' 



for each entry in C. It follows that over the basis {+}, C+(f A ,' B ) = mn and 

1 . Two special cases of matrix addition are the addition of square matrices 
(m = n), denoted + n , and the addition of row or column vectors that are either 1 x n or 
77i x 1 matrices. 



l >+Ua+b 



The matrix multiplication function / 



(«) 
AxB 



j^(m+p)r 



TZ m P multiplies an m x 



n matrix A = 
f A n UA,B) 



''../ 



by an 77 x p matrix B = [bij] to produce the m x p matrix C 



A x n B = [cij], where 



'■J 



X^ a i.k * b 



Kj 



(6.1) 



fe=l 



and X n is the infix matrix multiplication operator. The subscript on x n is usually dropped 
when the dimensions of the matrices are understood. The standard matrix multiplication 
algorithm for multiplying an 777 x 77 matrix A by an 77 x p matrix B forms rnp inner products 
of the kind shown in equation (6.1). Thus, it uses mnp instances of the ring multiplication 
operator and 777,(77 — \)p instances of the ring addition operator. 

A fast algorithm for matrix multiplication is given in Section 6.3.1. It is now straightfor- 
ward to show the following result. (See Problem 6.4.) 

THEOREM 6.2. 1 Let M nxn be the set of n x 77 matrices over a commutative ring TZ. The 
system M n x.n = (M nxn ,+ n , x n ,O n ,I n ), where + n and x n are the matrix addition and 
multiplication operators and n and I n are the n x n zero and identity matrices, is a ring. 

The ring of matrices A4 nxn is not a commutative ring because matrix multiplication is not 
commutative. For example, the following two matrices do not commute, that is, AB 7^ BA: 



1 

1 



B 




-1 



A linear combination of a subset of the rows of an 77 x 777 matrix A is a sum of scalar 
products of the rows in this subset. A linear combination is non-zero if the sum of the scalar 
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product is not the zero vector. A set of rows of a matrix A over a field 1Z is linearly indepen- 
dent if all linear combinations are non-zero except when each scalar is zero. 

The rank of an n x m matrix A over a field 1Z, / r ™ nk : 1Z n *— > IN, is the maximum 
number of linearly independent rows of A. It is also the maximum number of linearly inde- 
pendent columns of A. (See Problem 6.5.) We write rank(A) = / r " nk (^4)- An n x n matrix 
A is non-singular if rank (A) = n. 

If an n x n matrix A over a field TZ is non-singular, it has an inverse A~ l that is an n x n 
matrix with the following properties: 

AA~ X = A' 1 A = I n 

where /„ is the n x n identity matrix. That is, there is a (partial) inverse function / inv : 
TZ n i— > TZ n that is defined for non-singular square matrices A such that f^J^A) = A -1 . 
/jny is partial because 
inverse over a field TZ. 



/jny is partial because it is not defined for singular matrices. Below we exhibit a matrix and its 



1 1 
-1 1 



1 -1 
1 1 



Algorithms for matrix inversion are given in Section 6.5. 

We now show that the inverse (AB)~ l of the product AB of two invertible matrices, A 
and B, over a field TZ is the product of their inverses in reverse order. 

LEMMA 6.2. 1 Let A and B be invertible square matrices over a field 1Z. Then the following 
relationship holds: 

{AB)' 1 = B- X A~ X 

Proof To show that (AB)~ l = B~ l A~ l , we multiply AB either on the left or right by 
B~ x A~ l to produce the identity matrix: 

AB{AB)~ l = ABB- l A- 1 = A{BB- l )A- 1 = AA~ l = I 
{AB)~ l AB = B~ l A~ l AB = B~ 1 {A~ 1 A)B = B~ l B = I a 

The transpose of the product of an m x n matrix A and annxj) matrix B over a ring TZ 
is the product of their transposes in reverse order: 

(AB) T = B T A T 

(See Problem 6.6.) In particular, the following identity holds for an m x n matrix A and a 
column n-vector X; 

x T A T = (Ax) T 

A block matrix is a matrix in which each entry is a matrix with fixed dimensions. For 
example, when n is even it may be convenient to view unnx n matrix as a 2 x 2 matrix whose 
four entries are (n/2) x (n/2) matrices. 

Two special types of matrix that are frequently encountered are the Toeplitz and circulant 
matrices. An n x n Toeplitz matrix T has the property that its (i,j) entry t%A = a r for 
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J = i — n+l + r and < r < In — 2. A generic Toeplitz matrix T is shown belov 



^n— 2 Qn—l ®"ri 
O-n-3 O-n-2 O-n-l 



(l{> 



d\ 



O-ln-2 
a 2n-3 
O-ln-A 



An n X n circulant matrix C has the property that the entries on the fcth row are a right 
cyclic shift by k — 1 places of the entries on the first row, as suggested below. 



C 



ao a\ a,2 a n -i 

a n-\ a a \ ■ ■ ■ On-2 

Qn-2 An-i OO ■ ■ ■ On-3 



"1 



f'l 



«3 



0() 



The circulant is a type of Toeplitz matrix. Thus the function defined by the product of a 
Toeplitz matrix and a vector contains as a subfunction the function defined by the product of 
a circulant matrix and a vector. Consequently, any algorithm to multiply a vector by a Toeplitz 
matrix can be used to multiply a circulant by a vector. 

As stated in Section 2.1 1, a permutation tt : lZ n i— > lZ n of an n-tuple a: = (xi, X2, ■ ■ ■ , 
x n ) over the set 1Z is a rearrangement 7r(aj) = (a^m, 3^(2), • • • > ^(n)) °f tne components 
of a;. An x n permutation matrix P has entries from the set {0, 1} (here and 1 are the 
identities under addition and multiplication for a ring 1Z) with the property that each row 
and column of P has exactly one instance of 1. (See the example below.) Let A be an n x n 
matrix. Then AP contains the columns of A in a permuted order determined by P. A similar 
statement applies to PA. Shown below is a permutation matrix P and the result of multiplying 
it on the right by a matrix A on the left. In this case P interchanges the first two columns of A, 



12 3 4 

5 6 7 8 

9 10 11 12 

13 14 15 16 



10 
10 
10 
1 



2 13 4 

6 5 7 8 

10 9 11 12 

14 13 15 16 



6.3 Matrix Multiplication 



Matrix multiplication is defined in Section 6.2. The standard matrix multiplication algo- 
rithm computes the matrix product using the formula for Cjj given in (6.1). It performs nmp 
multiplications and n(m — \)p additions. As shown in Section 6.3.1, however, matrices can 
be multiplied with many fewer operations. 

Boolean matrix multiplication is matrix multiplication for matrices over B when + de- 
notes OR and * denotes AND. Another example is matrix multiplication over the set of integers 
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modulo a prime p, a set that forms a finite field under addition and multiplication modulo p. 
(See Problem 6.3.) 

In the next section we describe Strassen's algorithm, a straight-line program realizable by a 
logarithmic-depth circuit of size 0{n ) . This is not the final word on matrix multiplication, 
however. Winograd and Coppersmith [81] have improved the bound to 0(n 238 ). Despite 
this progress, the smallest asymptotic bound on matrix multiplication remains unknown. 

Since later in this chapter we design algorithms that make use of matrix multiplication, 
it behooves us to make the following definition concerning the number of ring operations to 
multiply two n x n matrices over a ring 7Z. 

DEFINITION 6.3. 1 Let K > 1. Then M- axs ,t r - lyi {n,K) is the size of the smallest circuit of depth 
K log 2 n over a commutative ring 7Z for the multiplication of two n X n matrices. 

The following assumptions on the rate of growth of M ma t r ; x (n, K) with n make subse- 
quent analysis easier. They are satisfied by Strassen's algorithm. 

ASSUMPTION 6.3.1 We assume that for all c satisfying < c < 1 andn > 1, 

Mmatrbc (en, K) < C 2 M mat rix(n, K) 

ASSUMPTION 6.3.2 We assume there exists an integer no > such that, for n > no, 

In 1 < M matlix (n,K) 

6.3.1 Strassen's Algorithm 

Strassen [3 1 9] has developed a fast algorithm for multiplying two square matrices over a com- 
mutative ring 1Z. This algorithm makes use of the additive inverse of ring elements to reduce 
the total number of operations performed. 

Let n be even. Given two n x n matrices, A and B, we write them and their product C 
as 2 x 2 matrices whose components are (n/2) x (n/2) matrices: 



C 



u v 
w x 



Ax B 



a b 


X 


' e /" 


c d 




9 h 



Using the standard algorithm, we can form C with eight multiplications and four additions 
of {n/2) x {n/2) matrices. Strassen's algorithm exchanges one of these multiplications for 
10 such additions. Since one multiplication of two {n/2) x {n/2) matrices is much more 
costly than an addition of two such matrices, a large reduction in the number of operations is 
obtained. We now derive Strassen's algorithm. 

Let D be the the 4x4 matrix shown below whose entries are {n/2) x {n/2) matrices. 
(Thus, D is a 2n x 2n matrix.) 



D 



a b 

c d 

a b 

c d 
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The entries u, v, w, and x of the product A x B can also be produced by the following 
matrix-vector product: 



u 




e 


w 

V 


= D x 


9 

f 


x 




. h _ 



We now write D as a sum of seven matrices as shown in Fig. 6.2; that is, 
D = A l +A 2 + A i + A 4 + A i + A 6 + A 7 
Let P\,P2, ■ ■ ■ ,Pj be the products of the (ft/2) X (n/2) matrices 



» 


(o + d) x (e + h) 


P5 = 


= (a + b) x h 


'2 = 


(c + d) x e 


^6 = 


= (-a + c) x (e + /) 


'3 = 


ax (f -h) 


^7 = 


= (6-d)x(ff + /0 


'4 = 


dx (-e + g) 







-1, 
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+ d 
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, + d 
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c+d 



-(c+d) 

-d d 

-d d 













i + c 



Figure 6.2 The decomposition of the 4x4 matrix D as the sum of seven 4x4 matrices. 
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Then the product of the vector [e, g, f, h] with D is the following sum of seven column 
vectors. 



u 




' Pi ' 




' " 




' " 




' Pa ' 




" -Pi ' 




' " 




' p 7 ' 


w 

V 


= 






+ 


Pi 




+ 




Pi 


+ 


Pa 



+ 




Pi 


+ 






+ 






X 




Pi 




. -Pi . 




Pi 














Pfi 








It follows that U, v, w, and x are given by the following equations: 



u = 


-- Pi + Pi-Ps + Pj 


V 


= 


Pi + Pi 


w - 


Pi + Pa 


X 


= Pl- 


-P1 + P3 



P(, 

Associativity and commutativity under addition and distributivity of multiplication over ad- 
dition are used to obtain this result. In particular, commutativity of the ring multiplication 
operator is not assumed. This is important because it allows this algorithm to be used when 
the entries in the original 2x2 matrices are themselves matrices, since matrix multiplication 
is not commutative. 

Thus, an algorithm exists to form the product of two n x n matrices with seven multi- 
plications of (n/2) x (n/2) matrices and 18 additions or subtractions of such matrices. Let 
n = 2 and M(k) be the number of operations over the ring 1Z used by this algorithm to 
multiply n x n matrices. Then, M(k) satisfies 



M(k) = 7M(k - 1) + 18 (2 k ~ l ) = 7M(k - 1) + (18)4 



/,k-l 



If the standard algorithm is used to multiply 2x2 matrices, M{\) 
the following recurrence: 

M(k) = (36/7)7 fc -(18/3)4 fe 



12 and M(k) satisfies 



The depth (number of operations on the longest path), D(k), of this straight-line algo- 



rithm for the product of two n X n matrices when n ■ 

D(k) = D(k- 1) 



2 satisfies the following bound: 



because one level of addition or subtraction is used before products are formed and one or two 
levels are used after they are formed. Since D(l) = 2 if the standard algorithm is used to 
multiply 2x2 matrices, D(k) = 3k — 1 = 3 log n — 1. 

These size and depth bounds can be improved to those in the following theorem by using 
the standard matrix multiplication algorithm on small matrices. (See Problem 6.8.) 

THEOREM 6.3. 1 The matrix multiplication function for n x n matrices over a commutative ring 



n,f 



(n) 
AxB> 



as circuit size and depth satisfying the following bounds over the basis fl containing 



addition, multiplication, and additive inverse over TZ: 

^n(/S B ) =0(logn 



2 7 
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We emphasize again that subtraction plays a central role in Strassen's algorithm. Without 
it we show in Section 10.4 that the standard algorithm is nearly best possible. 

Strassen's algorithm is practical for sufficiently large matrices, say with n > 64. It can 
also be used to multiply Boolean matrices even though the addition operator (OR) and the 
multiplication operator (AND) over the set B do not constitute a ring. (See Problem 6.9.) 

6.4 Transitive Closure 

The edges of a directed graph G = (V, E),n = \V\, specify paths of length 1 between pairs of 
vertices. (See Fig. 6.3.) This information is captured by the Boolean n X n adjacency matrix 
A = [<Xi t j], 1 < i, j < n, where aij is 1 if there is an edge from vertex i to vertex j in E and 
otherwise. (The adjacency matrix for the graph in Fig. 6.3 is given after Lemma 6.4.1.) Our 
goal is to compute a matrix A* whose i, j entry a* has value 1 if there is a path of length 
or more between vertices i and j and value otherwise. A* is called the transitive closure 
of the matrix A. The transitive closure function f\J : B n (— > B n maps an arbitrary n x n 
Boolean matrix A onto its n x n transitive closure matrix; that is, f\ m (A) = A*. In this 
section we add and multiply Boolean matrices over the set B using OR as the element addition 
operation and AND as the element multiplication operation. (Note that (B, V, A, 0, 1) is not 
a ring; it satisfies all the rules for a ring except for the condition that each element of B have 
an (additive) inverse under V.) 

To compute A* we use the following facts: a) the entry in the rth row and sth column 
of the Boolean matrix product A = A x A is 1 if there is a path containing two edges from 
vertex r to vertex s and otherwise (which follows from the definition of Boolean matrix 
multiplication given in Section 6.3), and b) the entry in the rth row and sth column of 
A = A x A is 1 if there is a path containing k edges from vertex r to vertex s and 
otherwise, as the reader is asked to show. (See Problem 6.1 1.) 

LEMMA 6.4. 1 Let A be the Boolean adjacency matrix for a directed graph and let A be the kth 
power of A. Then the following identity holds for k > 1, where + denotes the addition (OR,) of 
Boolean matrices: 

(I + A) k = I+ A+--- + A k (6.2) 

Proof The proof is by induction. The base case is A; = 1, for which the identity holds. 
Assume that it holds for k < K— 1. We show that it holds for k = K. Since (I+A) K ~ 1 = 




Figure 6.3 A graph that illustrates transitive closure. 
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I + A-\ h A K ~\ multiply both sides by I + A: 

(I + A) K = (I + A)x(I + A) K ~ l 

= {I + A)x{I+A+--- + A K ~ l ) 

= I+{A + A) + --- + {A K - l + A K ~ 1 ) + A K 

However, since A 3 is a Boolean matrix, A 1 + A 3 = A 3 for all j and the result follows. ■ 

The adjacency matrix A of the graph in Fig. 6.3 is given below along with its powers up to 
the fifth power. Note that every non-zero entry appearing in A 5 appears in at least one of the 
other matrices. The reason for this fact is explained in the proof of Lemma 6.4.2. 



10 110 
10 
10 
10 
110 1 



1111 
10 
10 
10 
10 110 



LEMMA 6.4.2 If there is a path between pairs of vertices in the directed graph G = (V,E), 
n = \V\, there is a path of length at most n — 1. 

Proof We suppose that the shortest path between vertices i and j in V has length k > n. 
Such a path has k + 1 vertices. Because k + 1 > n + 1, some vertex is repeated more than 
once. (This is an example of the pigeonhole principle.) Consider the subpath defined by the 
edges between the first and last instance of this repeated vertex. Since it constitutes a loop, 
it can be removed to produce a shorter path between vertices i and j. This contradicts the 
hypothesis that the shortest path has length n or more. Thus, the shortest path has length 
at most n — 1 . ■ 

Because the shortest path has length at most n— \, any non-zero entries in A , k > n, ace 
also found in one of the matrices A 3 , j < n — 1. Since the identity matrix / is the adjacency 
matrix for the graph that has paths of length zero between two vertices, the transitive closure, 
which includes such paths, is equal to: 



A* 
It also follows that A* 



I + A 



{1 + A) k i~ora\\k > n 



I- A™" 1 = {I + A) 71 ' 1 
1 , which leads to the following result. 



THEOREM 6.4. 1 Over the basis 51 = {AND, OR} the transitive closure function, f\J, has circuit 
size and depth satisfying the following bounds (that is, a circuit of this size and depth can be 
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constructed with AND and OK gates for it): 






An) 
J A* 

An) 
J A' 



< A/matrix (en, K) [log 2 n\ 

< K(\og n) [log 2 n\ 



Proof Let k = 2 P be the smallest power of 2 such that k > n—\. Then, p = |log 2 (n— 1)] . 
Since A* = (I + A) , it can be computed with a circuit that squares the matrix I + A p 
times. Each squaring can be done with a circuit for the standard matrix multiplication algo- 
rithm described in (6.1) using M matr ix(cn, K) = 0(n 3 ) operations and depth [log 2 2n~\. 
The desired result follows. ■ 

The above statement says that the transitive closure function on n x n matrices has circuit 
size and depth at most a factor O(logn) times that of matrix multiplication. We now show 
that Boolean matrix multiplication is a subfunction of the transitive closure function, which 
implies that the former has a circuit size and depth no larger than the latter. We subsequently 
show that the size bound can be improved to a constant multiple of the size bound for matrix 
multiplication. Thus the transitive closure and Boolean matrix multiplication functions have 
comparable size. 



THEOREM 6.4.2 The 



n x n 



matrix multiplication function f^xB ' ^ 2 " ^ ^™ for Boolean 



(3n) 



matrices is a subfunction of the transitive closure function f A , : 1Z 



K 9n ' \ 



Proof Observe that the following relationship holds for n x n matrices A and B, since the 
third and higher powers of the "in x 3n matrix on the left are 0. 



A 
B 




I A AB 
I B 
A I 



It follows that the product AB ofnxn matrices is a subfunction of the transitive closure 
function on a 3n x 3n matrix. ■ 



COROLLARY 6.4. 1 It follows that 



Cn(f in) 



A 



(/ 



AxB 

(n) 
AxB 






over the basis 51 = {AND, OR}. 



Not only can a Boolean matrix multiplication algorithm be devised from one for transitive 
closure, but the reverse is also true, as we show. Let n be a power of 2 and divide an n x n 
matrix A into four (n/2) x (n/2) matrices: 



U V 
W X 



(6.3) 
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Compute X* recursively and use it to form Y = U + VX*W by performing two multiplica- 
tions of (n/2) x (n/2) matrices and one addition of such matrices. Recursively form Y* and 
then assemble the matrix B shown below with four further multiplications and one addition 

of (n/2) x (n/2) matrices. 



B 



We now show that B = A* . 



Y* Y*VX* 

X*WY* X* +X*WY*VX* 



(6A) 



THEOREM 6.4.3 Under Assumptions 6.3.1 and 6.3.2, a circuit of size 0(M matr i x (n, K)) and 
depth O(n) exists to form the transitive closure ofn x n matrices. 

Proof We assume that n is a power of 2 and use the representation for the matrix A given 
in (6.3). If n is not a power of 2, we augment the matrix A by embedding it in a larger 
matrix in which all the new entries, are except for the new diagonal entries, which are 1 . 
Given that 4M(n) < M(2n), the bound applies. 

We begin by showing that B = A* . Let F C V and S C V be the first and second 
sets of n/2 vertices, respectively, corresponding to the first and second halves of the rows 
and columns of the matrix A. Then, F U S = V and F (1 S = 0. Observe that X* is 
the adjacency matrix for those paths originating on and terminating with vertices in F and 
visiting no other vertices. Similarly, Y = U + VX*W is the adjacency matrix for those 
paths consisting of an edge from a vertex in F to a vertex in F or paths of length more 
than 1 consisting of an edge from vertices in F to vertices in S, a path of length or more 
within vertices in S, and an edge from vertices in S to vertices in F. It follows that Y* is 
the adjacency matrix for all paths between vertices in F that may visit any vertices in V. A 
similar line of reasoning demonstrates that the other entries of A* are correct. 

The size of a circuit realizing this algorithm, T(n), satisfies 

T(n) = 2T{n/2) + 6M matrix (ri/2 ) K) + 2{n/2) 2 

because the above algorithm (see Fig. 6.4) uses two circuits for transitive closure on (n/2) x 
(n/2) matrices, six circuits for multiplying, and two for adding two such matrices. 

Because we assume that n 2 < M matr i x (?i,, K), it follows that T(n) < 2T(n/2) + 
8M ma trix(^/2, K). Let T(m) < cM matl -i x (cm, K) for m < n/2 be the inductive hy- 
pothesis. Then we have the inequalities 

T(n) < (2c+8)M matlix (n/2,K) < (c/2 + 2)M matrix (n, K) 

which follow from M matr i x (n/2, K) < A/ matr i x (n, K)/A (see Assumption 6.3.2). Because 
(c/2 + 2) < c for c > 4, for c = 4 we have the desired bound on circuit size. 

The depth D(n) of the above circuit satisfies D(n) = 2D(n/2) + 6Klog 2 n, from 
which we conclude that D(n) = 0(n). ■ 

A semiring (S, +, •, 0, 1) is a set S, two operations + and ■ and elements 0, 1 € S with 
the following properties: 

a) S is closed under + and ■; 

b) + and • are associative; 
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X* + X*WY*VX 




Figure 6.4 A circuit for the transitive closure of a Boolean matrix based on the construction of 
equation (6.4). 



c) for all a g S, a + = + a = a; 

d) for all a g S, a ■ 1 = 1 • a = a; 

e) + is commutative and idempotent; i.e. a + a = a; 

f) ■ distributes over +; i.e. for all a, b, c G S, a ■ (b + c) = a ■ b + a ■ c 
and (b + c) ■ a = b ■ a + c ■ a. 

The above definitions and results generalize to matrices over semirings. To show this, it suf- 
fices to observe that the properties used to derive these results are just these properties. (See 
Problem 6.12.) 



6.5 Matrix Inversion 

The inverse of a non-singular n x n matrix M defined over a field 1Z is another matrix M~ l 
whose product with AI is the n X n identity matrix /; that is, 

MM' 1 = M~ X M = I 
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Given a linear system of n equations in the column vector x of n unknowns defined by 
the non-singular n x n coefficient matrix M and the vector b, namely, 

Mx = b (6.5) 

the solution x can be obtained through a matrix-vector multiplication with M~ : 

x = M~ l b 

In this section we present two algorithms for matrix inversion. Such algorithms compute 



the (partial) matrix inverse function / 



(n) 



TZ n t— > 1Z" that maps non-singular n x n 



matrices over a field 1Z onto their inverses. The first result, Theorem 6.5.4, demonstrates that 
Cq I f A -i ) = @( -^matrix ("■> K)) with a circuit whose depth is more than linear in n. The 

second, Theorem 6.5.6, demonstrates that Dq I fY-i ) = 0(log n) with a circuit whose size 



is 0(nM matrix (n, K)). 

Before describing the two matrix inversion algorithms, we present a result demonstrating 
that matrix multiplication ofnxn matrices is no harder than inverting a 3n X 3n matrix; the 
function defining the former task is a subfunction of the function defining the latter task. 



f(3n) 



f(«) 



LEMMA 6.5. 1 The matrix inverse function /V'-V contains as a subfunction the function /^"' B : 

TZ i— > 1Z n that maps two matrices over 1Z to their product. 

Proof The proof follows by writing a 3n x 3n matrix as a 3 x 3 matrix ofnxn matrices 
and then specializing the entries to be the identity matrix /, the zero matrix 0, or matrices 
A and B: 



I 


A 








I 


B 








I 



I 


-A 


AB 





I 


-B 








I 



This identity is established by showing that the product of these two matrices is the identity 
matrix. ■ 

6.5.1 Symmetric Positive Definite Matrices 

Our first algorithm to invert a non-singular n x n matrix M has a circuit size linear in 
-^matrix (ji, K), which, in light of Lemma 6.5.1, is optimal to within a constant multiplicative 
factor. This algorithm makes use of symmetric positive definite matrices, the Schur comple- 
ment, and LDL factorization, terms defined below. This algorithm has depth 0(nlog n). 
The second algorithm, Csanky's algorithm, has circuit depth 0(log n), which is smaller, 
but circuit size 0(nM matr ix('T,, K)), which is larger. Symmetric positive definite matrices are 
defined below. 

DEFINITION 6.5. 1 A matrix M is positive definite if for all non-zero vectors x the following 
condition holds: 

x Mx = \ XiUiijXj > (6.6) 

l<i,j<n 

A matrix is symmetric positive definite (SPD) if it is both symmetric and positive definite. 
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We now show that an algorithm to invert SPD matrices can be used to invert arbitrary 
non-singular matrices by adding a circuit to multiply matrices. 

LEMMA 6.5.2 IfM is a non-singular n x n matrix, then the matrix P = MM is symmetric 
positive definite. M can be inverted by inverting P and then multiplying P~ Y by M . Let 
/spd inverse ■ ^™ l— * ^™ be f ^ e inverse function for n x n SPD matrices over the field 1Z. 



,(n) 



Then the size and depth of $\1\ over 1Z satisfy the following bounds: 

C (/W ) < C (/^ DJnvcl , c ) + M matrix (n, K) 

D (/£?,) < D (/^inverse) + O(logn) 



Proof To show that P is symmetric we note that [M Mj = MM. To show that it is 
positive definite, we observe that 

x T Px = x T M T Mx 
= (Mx) T Mx 

n I 7i 

which is positive unless the product Mx is identically zero for the non-zero vector x. But 
this cannot be true if M is non-singular. Thus, P is symmetric and positive definite. 

To invert M, invert P to produce M~ l (M ) . If we multiply this product on the 
right by M , the result is the inverse Af _1 . ■ 

6.5.2 Schur Factorization 

We now describe Schur factorization. Represent an n x n matrix M as the 2x2 matrix 



M 



M 2 ,\ M 2 , 2 



(6.7) 



where M14, M\^, M^\, and M 2l i are fc x k, kxn— k, n— kx k, and n — kxn — k matrices, 
1 < k < n — 1. Let M^i be invertible. Then by straightforward algebraic manipulation M 
can be factored as 



M 



I 


" 




' M hl 


" 




' I 


A/,>/i,2 


2,1 M^ 1 


I 







s 







I 



(6.8) 



Here / and O denote identity and zero matrices (all entries are zero) of a size that conforms 
to the size of other submatrices of those matrices in which they are found. This is the Schur 
factorization. Also, 

S = M 2 , 2 - MuM^Mia 

is the Schur complement of M. To show that At has this factorization, it suffices to carry out 
the product of the above three matrices. 
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The first and last matrix in this product are invertible. If S is also invertible, the middle 
matrix is invertible, as is the matrix M itself. The inverse of M, M~ l , is given by the product 



M' 



I 


-Af-'Mu " 




" ^U 







I 








I 







s- 1 




-M XI M^ 


I 



(6.9) 



This follows from three observations: a) the inverse of a product is the product of the inverses 
in reverse order (see Lemma 6.2.1), b) the inverse of a 2 x 2 upper (lower) triangular matrix 
is the matrix with the off-diagonal term negated, and c) the inverse of a 2 x 2 diagonal matrix 
is a diagonal matrix in which the ith diagonal element is the multiplicative inverse of the ith 
diagonal element of the original matrix. (See Problem 6.13 for the latter two results.) 
The following fact is useful in inverting SPD matrices. 

LEMMA 6.5.3 If M is an n x n SPD matrix, its Schur complement is also SPD. 

Proof Represent M as shown in (6.7). In (6.6) let x = u ■ v; that is, let x be the concate- 
nation of the two column vectors. Then 



x T Mx= [u T ,v T ] 



M u u + M xa v 
Mi,\U + M 2<2 v 

u T M hl u + u T M h2 v + v T M 2i iu + v T M 2l2 v 



If we say that 



u= -M~^M h2 v 



and use the fact that Mf 2 = M%\ and (M lx ) = (Mj^) = M ljX , it is straightforward 
to show that S is symmetric and 

x T Mx = v T Sv 

where S is the Schur complement of M. Thus, if M is SPD, so is its Schur complement. ■ 

6.5.3 Inversion of Triangular Matrices 

Let T be n x n lower triangular and non-singular. Without loss of generality, assume that 
n = 2 r . (T can be extended to a 2 r x 2 r matrix by placing it on the diagonal of a 2 r x 2 r 
matrix along with a 2 r — n x 2 r — n identity matrix.) Represent T as a 2 x 2 matrix of 
n/2 x n/2 matrices: 

" T hl 

22,1 2~2,2 

The inverse of T, which is lower triangular, is given below, as can be verified directly: 

rp — It - * n~* — 1 r p — 1 
_1 2,2 ± %\- L \,\ L 2,l 
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Figure 6.5 A recursive circuit TRI_INV[n] for the inversion of a triangular matrix. 



This representation for the inverse of T defines the recursive algorithm TRI_LNV[n] in 
Fig. 6.5- When n = 1 this algorithm requires one operation; on annXn matrix it requires 
two calls to TRI_INV[n/2] and two matrix multiplications. Let / tI "_ inv : TZ^ n +™)/ 2 i— > 
72.'™ +n '' be the function corresponding to the inversion of an n x n lower triangular ma- 
trix. The algorithm TRI_INV[n] provides the following bounds on the size and depth of the 
smallest circuit to compute / tr ™j nv . 

THEOREM 6.5. 1 Let n be a power of 2. Then the matrix inversion function / tri inv for n X n 
lower triangular matrices satisfies the following bounds: 



C 



V J tri_inv J 



< Mr, 



,(n,K) 



D(fi;l im ) =0(log 2 n) 

Proof From Fig. 6.5 it is clear that the following circuit size and depth bounds hold if the 
matrix multiplication algorithm has circuit size M matr j x (n, K) and depth A"log 2 n: 



c(/t ( H 



<2C(f^ v 



(") 

tri_inv 



< D 



(/■ 



(n/2) 
tri_inv 



f2M matrix (n/2,ir) 
IK log n 



The solution to the first inequality follows by induction from the fact that A/ matr i x ( 1 , K) = 
1 and the assumption that 4M matr i x (ri/2, K) < M matr i x (n, K). The second inequality 
follows from the observation that d > can be chosen so that d log (n/2) + c log n < 
d log n for any c > for n sufficiently large. ■ 
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6.5.4 LDL T Factorization of SPD Matrices 

Now that we know that the Schur complement S of M is SPD when M is SPD, we can show 
that every SPD matrix M has a factorization as the product LDL of a unit lower triangular 
matrix L (each of its diagonal entries is the multiplicative unit of the field 1Z), a diagonal 
matrix D, and the transpose of L. 

THEOREM 6.5.2 Every n x n SPD matrix M has a factorization as the product M = LDL T , 
where L is a unit lower triangular matrix and D is a diagonal matrix. 

Proof The proof is by induction on n. For n = 1 the result is obvious because we can write 
[mil] = [l][ m i,i][l]- Assume that it holds for n < N — 1. We show that it holds for 

71 = iV. 

Form the Schur factorization of the N x N matrix M . Since the k x k submatrix M\ t \ 
of M as well as the n — k x n — k submatrix S of AI ilk SPD, by the inductive hypothesis 
they can be factored in the same fashion. Let 

M u = LiL>iLf,S* = L 2 D 2 Lj 

Then the middle matrix on the right-hand side of equation (6.8) can be represented as 



M u 







Li 







A 







LI 








s 







L 2 







D 2 







T 1 



Substituting the above product for the middle matrix in (6.8) and multiplying the two left 
and two right matrices gives the following representation for M: 



M 



In 







D x 

D 2 



Lf LfMrfMu 




T T 



(6.10) 



A/^iAZ-'L, L 2 _ 
Since M is symmetric, Mij is symmetric, M\ t2 = M 2 x , and 

LfM~lM h2 = LfiM-lfM^ = (M 2il M^ 1 L l ) T 

Thus, it suffices to compute L\, D\, L 2 , D 2 , and M 2 ^M^ X L\. ■ 

When n = 2 r and k = n/2, the proof of Theorem 6.5.2 describes a recursive procedure, 
LDL [n], defined on n x n SPD matrices that produces their LDL factorization. Figure 6.6 
captures the steps involved. They are also described below. 

• The LDL factorization of the n/2 x n/2 matrix M\ t \ is computed using the proce- 
dure LDL [n/2] to produce the n/2 x n/2 triangular and diagonal matrices L\ and D\, 
respectively. 

• The product M 2 ,\M^ L\ = M 2 ,i \L~[ ) D^~ which may be computed by inverting the 
lower triangular matrix L\ with the operation TRI_INV[n/2], computing the product 

M 2 ,\ [L~[ ) using MULT[n/2], and multiplying the result with D^ using a procedure 
SCALE[n/2] that inverts D\ and multiplies it by a square matrix. 
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Figure 6.6 An algebraic circuit to produce the LDL factorization of an SPD matrix. 



• S = M 2 , 2 - M 2 ,iM^M U2 can be formed by multiplying M %x (L7 1 ) Df x by the 

transpose of A/2,1 \J^\ ) using MULT[n/2] and subtracting the result from M 2>2 by the 
subtraction operator SUB [n/2]. 

• The LDL factorization of the n/2 x n/2 matrix S is computed using the procedure 
LDL T [n/2] to produce the n/2 x n/2 triangular and diagonal matrices L 2 and D 2 , re- 
spectively. 

Let's now determine the size and depth of circuits to implement the algorithm for LDL [n] . 
Let /lq L t : TZ n ~ <— > 1Z( n ~ +n ^ 2 be the function defined by the LDL T factorization of an nxn 
SPD matrix, / t ™_ inv : TZ ( - n,2+n ^ 2 i-> TZ^+^Z 2 be the inversion of an nxn lower triangular 
matrix, / s ^ lc : 1Z n " +n i— > TZ n be the computation of N(D~ l ) for an n X n matrix N and 
a diagonal matrix D, /„ lult : 1Z ln *— > 7?-™ be the multiplication of two nxn matrices, and 
sub ' /v 



./; 



72.™ the subtraction of two nxn matrices. Since a transposition can be done 
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without any operators, the size and depth of the circuit for LDL [n] constructed above satisfy 
the following inequalities: 

C {&,) < C (/&&) + C (££) + 2C (/££>) + C (fW>) + 2C (/<£>) 
D (/<&,*) < D (/&&) + D (/££>) + 2D (/££>) + D (/(^)) + 215 (/££>) 

The size and depth of a circuit for / tri _ inv are M matr i x (n, K) and 0(log n), as shown in 

Theorem 6.5.1. The circuits for / scale and / "^ have size n 2 and depth 1 ; the former multiplies 
the elements of the jth column of N by the multiplicative inverse of jth diagonal element of 
D\ for 1 < j ' < n, while the latter subtracts corresponding elements from the two input 
matrices. 

Let C S p D H = C (/Slt) and AspdO) = D (/g LT ). Since M matrix (n/2, if) < 

(l/4)M matr i x (n, iiT) is assumed (see Assumption 6.3.1), and 2m < M ma t r i x (ni, K) (see 
Assumption 6.3.2), the above inequalities become 

Cspoin) < M matrix (n/2,K) + {n/2) 2 + 2M matlix (n/2, K) + {n/2) 2 + 2C SPD (n/2) 

< 2C SPD (n/2) + M matTix (n,K) (6.11) 
DspdH < 0(log 2 (n/2)) + 1 + 2O(l0g(n/2)) + 1 + 2D SPD {n/2) 

< 2D SPD {n/2) + K login for some K > (6.12) 

As a consequence, we have the following results. 

THEOREM 6.5.3 Let n be a power of two. Then there exists a circuit to compute the LDL 
factorization of an n x n matrix whose size and depth satisfy 



C 



(/ldlt) <2Af matrix (n,^) 



'LDL 

Proof From (6.1 1) we have that 



D (&t) <0(nlog 2 n) 



logra 

C SPD (n) < ^2 j M ma , tTix (n/2 j ,K) 

3=0 

By Assumption 6.3.2, M matr i x (n/2, K) < (l/4)M matr i x (n, K). It follows by induction 
that M majt rix(n/2 J , K) < (l/4) J Af ma trix(^> K), which bounds the above sum by a geo- 
metric series whose sum is at most 2M matr i x (n, K). The bound on D I / ldl t ) follows 

from the observation that (2c) (n/2) log {n/2) + clog n < en log n for n > 2 and 
c> 0. ■ 

This result combined with earlier observations provides a matrix inversion algorithm for 
arbitrary non-singular matrices. 
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THEOREM 6.5.4 The matrix inverse function f\-\ for arbitrary non-singular n x n matrices 
over an arbitrary field 1Z can be computed by an algebraic circuit whose size and depth satisfy the 
following bounds: 

C(/^_ ) 1 ) = &(M matlix (n,K)) 

£>(/W) = 0(nlog 2 n) 

Proof To invert a non-singular n x n matrix M that is not SPD, form the product P = 
MM (which is SPD) with one instance of MULT[n] and then invert it. Then multi- 
ply P~ l by M on the right with a second instance of MULT[n]. To invert P, compute 
its LDL factorization and invert it by forming II ] D~ l L~ l . Inverting LDL re- 
quires one application ofTRI_INV[n], one application of SCALE[ra], and one application of 
MULT[n], in addition to the steps used to form the factorization. Thus, three applications 
of MULT [n] are used in addition to the factorization steps. The following bounds hold: 



c(/. 



-i) < 4M ma t rix (n, K) + n 1 < 4.5A/ matrix (ra) 



An) 

A- 1 

D (f^}, \ = O (n log 2 n) + O(log n) = O (n log 2 n) 
The lower bound on C ( f\-i ) follows from Lemma 6.5.1. ■ 

6.5.5 Fast Matrix Inversion* 

In this section we present a depth-(9(log n) circuit for the inversion ofnxn matrices known 
as Csanky's algorithm, which is based on the method of Leverrier. Since this algorithm uses 
a number of well-known matrix functions and properties that space precludes explaining in 
detail, advanced knowledge of matrices and polynomials is required for this section. 

The determinant of an n x n matrix A, det(A), is defined below in terms of the set of all 
permutations n of the integers {1,2, ... , n}. Here the sign of 7r, denoted ct(tt), is the number 
of swaps of pairs of integers needed to realize 7r from the identity permutation. 



det(A) = J2(-ir M H< 



i.ir(«) 



Here Ili=i a i,n{i) is the product ai )7r n) • • ■ a n ,n(n)- The characteristic polynomial of a 
matrix A, namely, 4>a( x ) m the variable x, is the determinant of xl — A, where / is the nx n 
identity matrix: 

4>a{x) = det{xl - A) 

= X n + Cn-iX 11 ' 1 + C n - 2 X n ~ 2 + ■ ■ ■ + C 

If x is set to zero, this equation implies that Co = det(— A). Also, it can be shown that 
(j>A {A) = 0, a fact known as the Cayley-Hamilton theorem: A matrix satisfies its own 
characteristic polynomial. This implies that 

A {A n ~ l + c n -,A n - 2 + c n - 2 A n ~ 3 + ■ ■ ■ + ci) = -c / 

Thus, when Co =/= the inverse of A can be computed from 

A" 1 = — (A"" 1 + c n _! A n ~ 2 + c n . 2 A n - 3 + ■ ■ ■ + Cl ) 

CO 
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Once the characteristic polynomial of A has been computed, its inverse can be computed 
by forming the n—\ successive powers of A, namely, A, A 2 , A , . . . , A n ~ l , multiplying them 
by the coefficients of 4>a(x), and adding the products together. These powers of A can be 
computed using a prefix circuit having O(n) instances of the associative matrix multiplication 
operator and depth O(logn) measured in the number of instances of this operator. We have 
defined M matr i x (n, K) to be the size of the smallest n x n matrix multiplication circuit with 
depth Klogn (Definition 6.3.1). Thus, the successive powers of A can be computed by a 
circuit of size 0(nA/ matr i x (n, K)) and depth 0(log n). The size bound can be improved to 
Otv/nJlfnurtrixOi, K)). (See Problem 6.15.) 

To complete the derivation of the Csanky algorithm we must produce the coefficients of 
the characteristic polynomial of A. For this we invoke Leverrier's theorem. This theorem uses 
the notion of the trace of a matrix A, that is, the sum of the elements on its main diagonal, 
denoted tr{A). 

THEOREM 6.5.5 (Leverrier) The coefficients of the characteristic polynomial of the n x n matrix 

A satisfy the following identity, where s r = tr(A r ) for 1 < r < n: 



(6.13) 



1 
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•• 


Si 
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•• 


S2 


S] 
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Sn-l 



S2 Si 





Cn-1 




Si 




Cn-2 




Si 




Cn-3 


= - 


S3 




CO 




Sn 



Proof The degree-n characteristic polynomial 4>a{x) of A can be factored over a field of 
characteristic zero. If Ai, A2, . . . , A„ are its roots, we write 



l(*) = ]> 



From expanding this expression, it is clear that the coefficient c„_i of x n ~ l is — X^j=i Ar 
Similarly, expanding det(x/ — A), c n _i is the negative sum of the diagonal elements of A, 
that is, c„_i = —tr(A). It follows that tr(A) = J2j=i ^j- 

The \fs are called the eigenvalues of A, that is, values such that there exists an n-vector 
u (an eigenvector) such that Au = XjU. It follows that A r u = XjU. It can be shown 



=i( x 



*i 



is the characteristic 



that Aj, . . ., \ r n are precisely the eigenvalues of A r , so II™ 
polynomial of A r . Since s r = tr(A r ), s r = Y] ,•— 1 AL 

Let So = 1 and Sfc = for k < 0. Then, to complete the proof of (6.13), we must show 
that the following identity holds for 1 < i < n: 



Si-l c n-l + s i-lCn-2 + ' ' ' + S\C n _ i+ i + lC n -i — —Si 

Moving Si to the left-hand side, substituting for the traces, and using the definition of the 
characteristic polynomial yield 



J=i 



k 71 — i— 1 

K l 

\ n—i 
A 3 



Y^ 0a(Aj) - (A™ l c n -i + X" l Cn-i-i H h AjCi + c ) 
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Since 4>a(\j) = 0, when we substitute I for n — i it suffices to show the following for 
< I < n- 1: 

n I 
j = l k=0 J 

This identity can be shown by induction using as the base case I = and the following facts 
about the derivatives of the characteristic polynomial of A, which are easy to establish: 

n 

c = (-l) n l[X j 

Ck 



d k (j) A {x) 



dx k 






The reader is asked to show that (6.14) follows from these identities. (See Problem 6.17.) ■ 

Csanky's algorithm computes the traces of powers, namely the s r 's, and then inverts the 
lower triangular matrix given above, thereby solving for the coefficients of the characteristic 
polynomial. The coefficients are then used with a prefix computation, as mentioned earlier, to 
compute the inverse. Each of the n s r 's can be computed in O(n) steps once the powers of 
A have been formed by the prefix computation described above. The lower triangular matrix 
is non-singular and can be inverted by a circuit with M ma trix( n > K) operations and depth 
0(log n), as shown in Theorem 6.5.1. The following theorem summarizes these results. 

THEOREM 6.5.6 The matrix inverse function for non-singular n x n matrices over a field of 
characteristic zero, f\-i, has an algebraic circuit whose size and depth satisfy the following bounds: 



C 



(/Jl 5 .) =0(nM matvix (n,K)) 



C(f™) =0(log 2 n) 

The size bound can be improved to 0(vn-^matrix(^> K))> as suggested in Problems 6.15 
and 6.16. 



6.6 Solving Linear Systems 



A general linear system with nxn coefficient matrix M, n-vector x of unknowns and n-vector 
b is defined in (6.5) and repeated below: 

Mx = b 

This system can be solved for x in terms of AI and b using the following steps when M is not 
SPD. If it is SPD, the first step is unnecessary and can be eliminated. 

a) Premultiply both sides by the transpose of M to produce the following linear system in 
which the coefficient matrix MM is SPD: 

M T Mx = M T b = b* 
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b) Compute the LDL decomposition of M M. 

c) Solve the system (6.15) by solving thtee subsequent systems: 

LDL T x = b* (6.15) 

Lu = b* (6.16) 

Dv = u (6.17) 

L T x = v (6.18) 

Clearly, Lu = LDv = LDL T x = b* . 

The vector b* is formed by a matrix-vector multiplication that can be done with n mul- 
tiplications and n(n — 1) additions, for a total of 2n 2 — n operations. 

Since L is unit lower triangular, the system (6.16) is solved by forward elimination. The 
value of Ui is b*. The value of U2 is 6j — l2,iU\, obtained by eliminating U\ from the sec- 
ond equation. Similarly, on the jth step, the values of Mi, U2, ■ ■ ■ , Uj—\ are known and their 
weighted values can be subtracted from b* to provide the value of Uj; that is, 

Uj = b* — lj t \Ui — lj,2U2 — ■ ■ ■ — Ijj-iUj-l 

for 1 < j < n. Here n(n — l)/2 products are formed and n(n — l)/2 subtractions taken for 
a total of n(n — 1) operations. 

Since D is diagonal, the system (6.17) is solved for v by multiplying Uj by the multiplica- 
tive inverse of djj; that is, 

V 3 = U 3 d J,j 

for 1 < j < n. This is called normalization. Here n divisions are performed. 

Finally, the system (6.18) is solved for x by backward substitution, which is forward 
elimination applied to the elements of x in reverse order. 

THEOREM 6.6. 1 Let /spd_so1vc : R n +n | — > R n be the (partial) function that computes the 
solution to a linear system of equations defined by an n x n symmetric positive definite coefficient 
matrix AI . Then 

C(^Livc) <C(f^ LT ) + 0(n 2 ) 

D(&^ c )<C(f^ LT ) + 0(n) 

If M is not SPD but is non-singular, an additional 0(M matr i x (n, K )) circuit elements and 
depth 0(log n) suffice to compute it. 



6.7 Convolution and the FFT Algorithm 



The discrete Fourier transform (DFT) and convolution are widely used techniques with im- 
portant applications in signal processing and computer science. 

In this section we introduce the DFT, describe the fast Fourier transform algorithm, and 
derive the convolution theorem. The naive DFT algorithm on sequences of length n uses 
0(n ) operations; the fast Fourier transform algorithm uses only 0(nlogn) operations, a 
saving of a factor of at least 100 for n > 1,000. The convolution theorem provides a way 
to use the DFT to convolve two sequences in O(nlogn) steps, many fewer than the naive 
algorithm for convolution, which uses 0(n 2 ) steps. 
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6.7.1 Commutative Rings* 

Since the DFT is defined over commutative rings having an nth root of unity, we digress briefly 
to discuss such rings. (Commutative rings are defined in Section 6.2.) 

DEFINITION 6.7. 1 A commutative ringlZ = (R, +, *, 0, 1) has a principal nth root of unity 

U) if to £ R satisfies the following conditions: 



to 71 = 1 (6.19) 



n-\ 



Y^ u lk = for each 1 < I < n - 1 (6.20) 

fc=0 

The elements iu° , uj 1 , uj 2 , . . . ,uj n ~ 1 are the nth roots of unity and the elements uj° , U!~ l ,u>~ 2 , 
. . . ,u)~( n ~ l > are the nth inverse roots of unity. (Note that uj~ : * = LU n ~i is the multiplicative 
inverse of uj 1 since ui' J 'uj n ~ 3 = u> n = I.) 

Two commutative rings that have principal nth roots of unity are the complex numbers 
and the ring 7L m of integers modulo m = 2 tn ' 2 + 1 when t > 2 and n = 2 q , as we show. 
The reader is asked to show that 7L m has a principal nth root of unity, as stated below. (See 
Problem 6.24.) 

LEMMA 6.7.1 Let 7L m be the ring of integers modulo m when m = 2 tn ' 1 + 1, t > 2 and 
n = 2 q . Then to = 2 is a principal nth root of unity. 

An example of the ring TL m is given by t = 2, n = 4, and m = 2 + 1 = 17. In this 
ring uj = 4 is a principal fourth root of unity. This is true because uj n = 4= 16-16 = 
(16+1)(16-1) + 1 = 1 mod (16+1) and YljZl v pj = ((4 p ) n - l)/(4^ - 1) mod (17) 
= ((4")p - l)/(4 p - 1) mod (17) = (l p - l)/(4 p - 1) mod (17) = mod (17). 

LEMMA 6.7.2 e 27 ™'™ = cos(27r/n) + isin(27r/n) is a principal nth root of unity over the 
complex numbers where i = \f—\ is the "imaginary unit. " 

Proof The first condition is satisfied because (e 2 ^ 2 /™)™ = e } m = 1. Also, J]fc=o u>lk = 
{J n - I) /{J - 1) = if 1 < I < n- 1 forw = e 27ri /". ■ 

6.7.2 The Discrete Fourier Transform 

The discrete Fourier transform has many applications. In Section 6.7.4 we see that it can be 
used to compute the convolution of two sequences efficiently, which is the same as computing 
the coefficients of the product of two polynomials. The discrete Fourier transform can also be 
used to construct a fast algorithm (circuit) for the multiplication of two binary integers [303]. 
It is widely used in processing analog data such as speech and music. 

The n-point discrete Fourier transform F n : R" i— > R n maps n-tuples a = (a , ai , . . . , 
a„_i) over R to n-tuples / = (/o, /i, . . . , f n -\) over R; that is, F n (a) = f. The com- 
ponents of / are defined as the values of the following polynomial p(x) at the nth roots of 
unity: 

p(x) = ao + a\X + a-ix + ■ ■ ■ + a n -\x n ~ l (6.21) 
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Then f r , the rth component of F n (a), is defined as 

f r = P (cj r ) = ^a k u> rk (6.22) 

fc=0 

This computation is equivalent to the following matrix-vector multiplication: 

F n {a) = [w ij ] x a (6.23) 

where \uf 3 \ is the n x n Vandermonde matrix whose i,j entry is U) %3 , < i, j < n — 1, and 
a is treated as a column vector. 

The n-point inverse discrete Fourier transform F~ l : i?" i— > R n is defined as the values 
of the following polynomial q(x) at the inverse nth roots of unity: 

q(x) = (/o + fix + f 2 x 2 + --- + f n -ix n - l )/n (6.24) 

That is, the inverse DFT maps an n-tuple / to an n-tuple g, namely, F~ l (f) = g, where g s 
is defined as follows: 



Tl-l 

g. 



i Tl_1 

{oj- s )=-y^f l uj- u (6.25) 



n 

1=0 



This computation is equivalent to the following matrix-vector multiplication: 

T 



F-\f) 



xf 



Because of the following lemma it is legitimate to call F n l the inverse of F n . 

LEMMA 6.7.3 For all a G R n , a = F" 1 (F n (a)). 

Proof Let / = F n (a) and g = F~ 1 (f). Then g s satisfies the following: 



n— 1 n— 1 n— 1 

n * — ' n ^ — ' ^^ — ' 

n— 1 , n— 1 



E«4E- (fe - 



n 

fe=0 z=o 



The second equation results from a change in the order of summation. The last follows 
from the definition of nth roots of unity. It follows that the matrix [u)~ %J /ri] is the inverse 
of[w ij ].l 

The computation of the n-point DFT and its inverse using the naive algorithms suggested 
by their definitions requires 0(n ) steps. Below we show that a fast DFT algorithm exists for 
which only O(nlogn) steps suffice. 
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6.7.3 Fast Fourier Transform 



The fast Fourier transform algorithm is a consequence of the following observation: when 
n is even, the polynomial p(x) in equation (6.21) can be decomposed as 



p(x) 


= a + a\X + a-ix + ■ ■ ■ + a n -\X 




= (a + a 2 x 2 + ■ ■ ■ + a n _ 2 x n ~ 2 ) 




+ x (oi + (I3X + ■ ■ ■ + a n -\X 




= p e (x 2 ) + xp (x 2 ) 



(6.26) 

Herep e (y) and p (y) are polynomials of degree (n/2) — 1. 

Let n be a power of 2, that is, n = 2 . As stated above, the n-point DFT of a is 
obtained by evaluating p(x) at the nth roots of unity. Because of the decomposition of 
p(x), it suffices to evaluate p e (y) and p (y) at y = (cj ) 2 , (cj 1 ) 2 , (uj 2 ) 2 , . . . , (cj" _1 ) 2 = 
{lo )°,(u> ) , (u) ) , . . . ,(u! ) Tl_1 and combine their values with one multiplication and one 
addition for each of the n roots of unity. However, because cu 2 is a (n/2)th principal root 
of unity (see Problem 6.25), (uj 2 ) l - n ' 2 ' ,+r = (u> 2 ) r and the n powers of w 2 collapse to n/2 
distinct powers of lu 2 , namely, the (n/2)th roots of unity. Thus, p(x) at the nth roots of unity 
can be evaluated by evaluating p e {y) and p {y) at the (n/2)th roots of unity and combining 
their values with one addition and multiplication for each of the nth roots of unity. In other 
words, the n-point DFT of a can be done by performing the (n/2)-point DFT of its even 
and odd subsequences and combining the results with O(n) additional steps. This is the fast 
Fourier transform (FFT) algorithm. 

We denote by F^ ' the directed acyclic graph associated with the straight-line program 
resulting from this realization of the FFT on n = 2 inputs. A circuit for the 16-point FFT 
algorithm inputs, F^ ', is shown in Fig. 6.7. It is computed from the eight-point FFT on 
the even and odd components of a, as shown in the boxed regions. These components are 
permuted because each of these smaller FFTs is computed recursively in turn. (The index of 



io f\ fi h /4 fi /e h /s fa /10 /11 /12 hi fu /15 



Pe(x) 




Po(x) 



ao Clg 04 0,12 0-2 0-W 16 ^14 0-1 09 0-5 0-13 0-3 Oil 0-7 ^15 
Figure 6.7 A circuit F^ 4 ' for the FFT algorithm on 16 inputs. 
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the ith input vertex from the left is obtained by writing the integer i as a binary number, 
reversing the bits, and converting the resulting binary number to an integer. This is called the 
bit-reverse permutation of the binary representation of the integer. For example, the third 
input from the left has index 3, which is (Oil) in binary. Reversed, the binary number is (110), 
which represents 12.) Inputs are associated with the open vertices at the bottom of the graph. 
Each vertex except for input vertices is associated with an addition and a multiplication. For 
example, the white vertex at the top of the graph computes fg = p e ((uj 8 ) 2 ) + ui 8 p ((ui 8 ) 2 ), 
where (to 8 ) 2 = w 16 = u. 

Let C(F^) and D(F^ d >) be the size and depth of circuits for the 2 d -point FFT algorithm 
for integer d > 1 . The construction given above leads to the following recurrences for these 
two measures: 



C (F^) < 2C (F^-'A 
D (F^\ < D (V^" 1 )) - 



d+\ 



Also, examination of the base case of n = 2 demonstrates that C \F^ 1 ') = 3 and D [F^ 1 ') = 
2, from which we have the following theorem. 

THEOREM 6.7. 1 Let n = 2 . The circuit for the n-point FFT algorithm over a commutative 
ringlZ has the following circuit size and depth bounds: 

c[F (d A < 2n\ogn 
D(F {d A < 21ogn 



The FFT graph is used in later chapters to illustrate tradeoffs between space and time, space 
and the number of I/O operations, and area and time for computation with VLSI machines. 
For each of these applications we decompose the FFT graph into sub-FFT graphs. One such 
decomposition is shown in Fig. 6.7. A more general decomposition is shown in Fig. 6.8 and 
described below. 

LEMMA 6.7.4 The 2 d -point FFT graph F^ can be decomposed into 2 e 2 d ~ E -point bottom 
FFT graphs, {F { b d ~ e) \ 1 < j < 2 e }, and2 d ~ e 2 e -point top FFT graphs, {F^ \ 1 < j < 
2 d ~ e }. The ith input of 'F$ is the jth output of Fl~ e ' . 

In Fig. 6.8 the vertices and edges have been grouped together as recognizable FFT graphs 
and surrounded by shaded boxes. The edges between boxes are not edges of the FFT graph but 
instead are used to identify vertices that are simultaneously outputs of bottom FFT subgraphs 
and inputs to top FFT subgraphs. 

COROLLARY 6.7. 1 F^ ' can be decomposed into \dje\ stages each containing 2 copies of 
F^ and one stage containing 2 d ~ k copies of F^ k > , k = d — [d/e\ e. (F(0) is a single vertex.) 
The output vertices of one stage are the input vertices to the next. 

Proof From Lemma 6.7 A, each of the 2 e bottom FFT subgraphs F^ ' can be further 
decomposed into 2 d ~ 2e top FFT subgraphs F^ and 2 e bottom FFT subgraphs F^ d ~ 2e \ 
By repeating this process t times, t < d/e, F^ > can be decomposed into t stages each 
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6,1 6,2 x 6,3 

Figure 6.8 Decomposition of the 32-point FFT graph F^ ' into four copies of -F ' and 8 
copies of F "' . The edges between bottom and top sub-FFT graphs do not exist in the FFT 
graph. They are used here to identify common vertices and highlight the communication needed 
among sub-FFT graphs. 



containing 2 d e copies of F 1 -^ and one stage containing 2 d te copies of F^ d te \ The 
result follows by setting t = \d/e\ . ■ 

6.7.4 Convolution Theorem 

The convolution function / CO nv : R n + m i_ » j^n+m-i Qye[ a commutat i ve rm g Jl maps an 
n-tuple a = (do, a\, . . . , a n -i) and an m-tuple b = (bo, b\, . . . , 6 m _i) onto an (n + m— 1)- 
tuple c, denoted c = a® b, where Cj is defined as follows: 



r+s=j 



for < j < n 



Here ^ and * are addition and multiplication over the ring 1Z. The direct computation of the 
convolution function using the above formula takes O(nm) steps. The convolution theorem 
given below and the fast Fourier transform algorithm described above allow the convolution 
function to be computed in 0(n\ogn) steps when n = m. 

Associate with a and b the following polynomials in the variable x: 



a(x) = ao + d\x + a2X 

„2 



b(x) = b + b l x + b 2 x z + h 



+ a n - 1 x n 



-\X 



Then the coefficient of the term x J in the product polynomial c(x) = a(x)b(x) is clearly the 
term Cj in the convolution c = a (g> b. 

Convolution is used in signal processing and integer multiplication. In signal processing, 
convolution describes the results of passing a signal through a linear filter. In binary integer 
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multiplication the polynomials «(2) and 6(2) represent binary numbers; convolution is related 
to the computation of their product. 

The convolution theorem is one of the most important applications of the DFT. It 
demonstrates that convolution, which appears to require 0{n 2 ) operations when n = m, 
can in fact be computed by a circuit with 0(n) operations plus a small multiple of the number 
needed to compute the DFT and its inverse. 

THEOREM 6.7.2 Let 1Z = (R, +, *, 0, 1) be a commutative ring and let a, b £ R n . Let 

Fin ■ R ln ( — > R 2n and F^ 1 : R 2n t— ► R 2n be the 2n-point DFT and its inverse over R. Let 
F 2n {a) x F 2n (b) denote the 2n-tuple obtained from the term-by-term product of the components 
of F2 n (a) and F2 n (b) . Then, the convolution a®b satisfies the following identity: 

a®b=F 2 - n l (F 2n (a)xF 2n (b)) 

Proof The n-point DFT F n : R n t— > R n transforms the n-tuple of coefficients a of the 
polynomial p(x) of degree n — 1 into the n-tuple / = F n (a). In fact, the rth component 
of /, f r , is the value of the polynomial p(x) at the rth of the n roots of unity, namely 
f r = p(w r ). The n-point inverse DFT F~ l : R n \— > R n inverts the process through a 
similar computation. If q(x) is the polynomial of degree n—\ whose hh coefficient is fi/n, 
then the sth component of the inverse DFT on /, namely F~ 1 (f), is a s = q(uj~ s ). 

As stated above, to compute the convolution of n-tuples a and b it suffices to compute 
the coefficients of the product polynomial c(x) = a(x)b(x). Since the product c(x) is of 
degree In — 2, we can treat it as a polynomial of degree 2n — 1 and take the 2n-point 
DFT, F 2n , of it and its inverse, F 2n , of the result. This seemingly futile process leads to an 
efficient algorithm for convolution. Since the DFT is obtained by evaluating a polynomial 




Figure 6.9 The DAG associated with the straight-line program resulting from the application 
of the FFT to the convolution theorem with sequences of length 8. 
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at the n roots of unity, the DFT of c(x) can be done at the 2n roots of unity by evaluating 
a(x) and b(x) at the 2n roots of unity (that is, computing the DFTs of their coefficients 
as if they had degree In — 1), multiplying their values together, and taking the 2n-point 
inverse DFT, that is, performing the computation stated in the theorem. ■ 

The combination of the convolution theorem and the algorithm for the FFT provides a 
fast straight-line program for convolution, as stated below. The directed acyclic graph for this 
straight-line program is shown in Fig. 6.9 on page 269. 

THEOREM 6.7.3 Letn = 2 d . The convolution function f&$ : R 2n \-> i? 2 ^" 1 ' over a 
commutative ring 7Z can be computed by a straight-line program over 7Z with size and depth 
satisfying the following bounds: 

c(/&?) <12nlog2n 
D(f^) <41og2n 



6.8 Merging and Sorting Networks 



The sorting problem is to put into ascending or descending order a collection of items that 
are drawn from a totally ordered set. A set is totally ordered if for every two distinct elements 
of the set one is larger than the other. The merging problem is to merge two sorted lists into 
one sorted list. Sorting and merging algorithms can be either straight-line or non-straight-line. 
An example of a non-straight-line merging algorithm is the following: 

Create a new sorted list from two sorted lists by removing the smaller item from the 
two lists and appending it to the new list until one list is empty, at which point append 
the non-empty list to the end of the new list. 

The binary sorting function / sort : B n i— > B n described in Section 2.1 1 sorts a Boolean n- 
tuple into descending order. The combinational circuit given there is an example of a straight- 
line sorting network, a network realized by a straight-line program. When the set of elements 
to be sorted is not Boolean, sorting networks can become quite a bit more complicated, as we 
see below. 

In this section we describe sorting networks, circuits constructed from comparator oper- 
ators that take n elements drawn from a finite totally ordered set A and put them into sorted 
order. A comparator function ® : A i— > A. with arguments a and b returns their maximum 
and minimum; that is, (g> (o, b) = (max(a, 6),min(a, b)). 

It is convenient to show a comparator operator as a vertical edge between two lines carrying 
values, as in Fig. 6.10(a). The values on the two lines to the right of the edge are the values to 
its left in sorted order, the smaller being on the upper line. A sorting network is an example 
of a comparator network, a circuit in which the only operator is a comparator. Input values 
appear on the left and output values appear on the right in sorted order. 

Shown in Fig. 6.10(b) is an insertion-sorting network on five inputs that inserts an ele- 
ment into a previously sorted sublist. Two inputs are sorted at the wavefront labeled A. Between 
wavefronts A and B a new item is inserted that is compared against the previously sorted sublist 
and inserted into its proper position. The same occurs between wavefronts B and C and after 
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A fi ,C 



min(a, 



max(a, 



(a) 



(b) 



Figure 6. 10 (a) A comparison operator, and (b) an insertion-sorting network. 



wavefront C. An insertion-sorting network can be realized with one comparator for the first 
two inputs and fc — 1 more for the fcth input, 3 < k < n. Let Ci ns0T t{n) and -Desert ("0 
denote the size and depth of an insertion-sorting network on n elements. Then C(2) = 1 and 
D(2) = 1, and 

Cinscrt(n-) < C\ nsmt (n - 1) + n - 1 = n(n - l)/2 

Ansert(n) < max( D insmt {n - 1) + l,Tl - l) = Tl - 1 

The depth bound follows because there is a path of length n—\ through the chain of compara- 
tors added at the last wavefront and every path through the sorting network is extended by one 
comparator with the addition of the new wavefront. A simple proof by induction establishes 
these results. 



6.8. 1 Sorting Via Bitonic Merging 

We now describe Batcher's bitonic merging network BAl(m), which is the basis for a sorting 
network. Let x = (x\,X2, ■ ■ ■ ,x m ) and y = (y\,yi, ■ ■ ■ ,y m ) be ordered sequences of 
length m. That is, Xj < £j+i and yj < yj+i- As suggested in Fig. 6.11, the even-indexed 
components of a? are merged with the odd-indexed components of y, as are the odd-indexed 
components of a? and the even-indexed components of y. Each of the four lists that are merged 
are themselves sorted. The two lists are interleaved and the fcth and (fc+ l)st elements, fc even, 
are compared and swapped if necessary. To prove correctness of this circuit, we use the zero-one 
principle which is stated below for sorting networks but applied later to merging networks. 

THEOREM 6.8.1 (Zero-one principle) If a comparator network for inputs over a set A. correctly 
sorts all binary inputs, it correctly sorts all inputs. 

Proof The proof is by contradiction. Suppose the network correctly sorts all 0-1 sequences 
but fails to sort the input sequence {fl\, 0,2, . . . , a n ). Then there are inputs o, and aj such 
that Oi < a,j but the network puts aj before a^. 

Since a sorting network contains only comparators, if we replace each entry a r in an 
input sequence (a\,a2, ■ ■ ■ ,a n ) with a new entry h(a r ), where h(a) is monotonically 
non-decreasing in a (h(a) is non-decreasing as a increases), each comparison of entries 
a r and a s is replaced by a comparison of entries h(a r ) and h(a s ). Since a r < a s only 
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Figure 6.1 I A recursive construction of the bitonic merging network BM(4), The even- 
indexed elements of one sorted sequence are merged with the odd-indexed elements of the other, 
the resulting sequences interleaved, and the even- and succeeding odd-indexed elements com- 
pared. The inputs of one sequence are permuted to demonstrate that BM(4) uses two copies of 
BM(2). 



if h(a r ) < h(a s ), the set of comparisons made by the sorting network will be exactly 
the same on [a\, a 2 , ... , a n ) as on (h(a\), h(a 2 ), ■ ■ ■ , h(a n ))- Thus, the original output 
(b\, b 2 , . . . , b n ) will be replaced by the output sequence {h{b\), h(b 2 ), . . . , h(b n )). 

Since it is presumed that the comparator network puts a^ and a,j in the incorrect order, 
let h(x) be the following monotone function: 



h(x) 



if x < a,i 

1 if x > a; 



Then the input and output sequences to the comparator network are binary. However, 
the output sequence is not sorted (cij appears before a, but h(a,j) = 1 and h(a,i) = 0), 
contradicting the hypothesis of the theorem. It follows that all sequences over A must be 
sorted correctly. ■ 

We now show that Batcher's bitonic merging circuit correctly merges two sorted lists. If 
a correct m-sorter exists, then a correct 2m-sorter can be constructed by combining two m- 
sorters with a correct 2m-input bitonic merging circuit. It follows that a correct 2m-input 
bitonic merging circuit exists if and only if the resulting sorting network is correct. This is 
the core idea in a proof by induction of correctness of the 2m-input bitonic merging circuit. 
The basis for induction is the fact that individual comparators correctly sort sequences of two 
elements. 

Suppose that x and y are sorted 0—1 sequences of length m. Let x have k 0's and 
m — k l's, and let y have / 0's and m — I l's. Then the leftmost merging network of Fig. 6.1 1 
selects exactly \k/l\ 0's from x and [l/2\ 0's from y to produce the sequence u consisting of 
a = \k/2] + [l/2\ 0's followed by l's. Similarly, the rightmost merging network produces 
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the sequence v consisting of b = [k/2\ + [7/2] O's followed by Is. Since \x] — [x\ is or 
1, it follows that either a = b, a = b — 1, or a = b + 1. Thus, when u and v are interleaved 
to produce the sequence z it contains a sequence of a + b O's followed by l's when a = b or 
a = b + 1, or 2a O's followed by 1 followed by l's when a = b — 1, as suggested below: 

la 

z = 0,0, ...,0,1,0, 1,...,1 

Thus, if for each < fc < m — 1 the outputs in positions 2k and 2k + 1 are compared and 
swapped, if necessary, the output will be properly sorted. 

The graph of BM(4) of Fig. 6.11 illustrates that BA1(A) is constructed of two copies of 
BM{2). In addition, it demonstrates that the operations of each of the two BM(2) subnet- 
works can be performed in parallel. Another important observation is that this graph is iso- 
morphic to an FFT graph when the comparators are replaced by two-input butterfly graphs, 
as shown in Fig. 6.12. 

THEOREM 6.8.2 Batcher's 2n-input bitonic merging circuit BM(n) for merging two sorted n- 
sequences, n = 2 , has the following size and depth bounds over the basis il of comparators: 

C n {BM(n))<n(logn+l) 

D n (BM(n)) < logn+1 
Proof Let C{k) and D(k) be the size and depth of BM(n), Then C(0) = 1, D(0) = 1, 
C(k) = 2C{k - 1) + 2 fc , and D{k) = D{k - 1) + 1. It follows that C{k) = (k + l)2 fe 
and D(k) = k + 1. (See Problem 6.29.) ■ 

This leads us to the recursive construction of a Batcher's bitonic sorting network BS(n) 
for sequences of length n, n = 2 . It merges the output of two copies of BS(n/2) using 
a copy of Batcher's n-input bitonic merging circuit BM(n/2). The proof of the following 
theorem is left as an exercise. (See Problem 6.28.) 

THEOREM 6.8.3 Batcher's n-input bitonic sorting circuit BS(n) for n = 2 has the following 
size and depth bounds over the basis Q of comparators: 

Cn(BS(n))=^[log 2 n + logn] 



•?'o 
2/3 

X2 

V\ 

Xi 

yi 

Xi 

2/0 

Figure 6. 12 The graph resulting from the replacement of comparators in Fig. 6.1 1 with two- 
input butterfly graphs and the permutation of inputs. All edges are directed from left to right. 
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D a (BS(n)) = ilogn(logn-l) 
6.8.2 Fast Sorting Networks 

Ajtai, Komlos, and Szemeredi [14] have shown the existence of a sorting network (known 
as the AKS sorting network) on n inputs whose circuit size and depth are 0(n log n) and 
O(logra), respectively. The question had been open for many years whether such a sorting 
network existed. Prior to [14] it was thought that sorting networks required f2(log n) depth. 



Problems 

MATHEMATICAL PRELIMINARIES 

6.1 Show that (Zj, +, *, 0, 1) is a commutative ring, where + and * denote integer addition 
and multiplication and and 1 denote the first two integers. 

6.2 Let 7L V be the set of integers modulo p, p > 0, under addition and multiplication 
modulo p with additive identity and multiplicative identity 1 . Show that 7L p is a ring. 

6.3 A field J- is a commutative ring in which each element other than has a multiplicative 
inverse. Show that (Zj p , +, *, 0, 1) is a field when p is a prime. 

MATRICES 

6.4 Let M nxn be the set of n x n matrices over a ring 1Z. Show that {M nxn , +„, x n , 0„, 
I n ) is a ring, where +„ and X„ are the matrix addition and multiplication operators 
and n and I„ are the n x n zero and identity matrices. 

6.5 Show that the maximum number of linearly independent rows and of linearly indepen- 
dent columns ofann xm matrix A over a field are the same. 

Hint: Use the fact that permuting the rows and/or columns of A and adding a scalar 
product of one row (column) of A to any other row (column) does not change its rank. 
Use row and column permutations as well as additions of scalar products to rows and/or 
columns of A to transform A into a matrix that contains the largest possible identity 
matrix in its upper left-hand corner. This is called Gaussian elimination. 

6.6 Show that {AB) T = B T A T for all m x n matrices A and n x p matrices B over a 
commutative ring 1Z. 

MATRIX MULTIPLICATION 

6.7 The standard matrix-vector multiplication algorithm for a general nxn matrix requires 
0(n 2 ) operations. Show that at most 0(n Sl 3 ) operations are needed when the matrix 
is Toeplitz. 

Hint: Assume that n is a power of two and treat the matrix as a 2 x 2 matrix of 
njl x n/2 matrices. Also note that only In — 1 values determine all the entries in a 
Toeplitz matrix. Thus, the difference between two nxn Toeplitz matrices does not 
require n 2 operations. 
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6.8 Generalize Strassen's matrix multiplication algorithm to matrices that are m X m for 
m = p2 , p and k both integers. Derive bounds on the size and depth of a circuit 
realizing this version of the algorithm. 

For arbitrary n, show how n x n matrices can be embedded into m x m matrices, 
m = p2 , so that this new version of the algorithm can be used. Show that upper 
bounds of 4.77n S2? and O(logn) on the size and depth of this algorithm can be 
obtained. 

6.9 Show that Strassen's matrix multiplication algorithm can be used to multiply square 
Boolean matrices by replacing OR by addition modulo n + 1. Derive a bound on the 
size and depth of a circuit to realize this algorithm. 

6.10 Show that, when one of two n x n Boolean matrices A and B is fixed and known in 
advance, A and B can be multiplied by a circuit with 0(n /logn) operations and 
depth 0(logn) to produce the product C = AB using the information provided 
below. 

a) Multiplication of A and B is equivalent to n multiplications of A with an n x 1 
vector x, a column of B. 

b) Since A is a — 1 matrix, the product Ax consists of sums of variables in x. 

c) The product Ax can be further decomposed into the sum AyXy + A 2 x 2 + ■ • • + 
AkXk where k = [~n/ [log n] ~| , Aj is the n x [logn] submatrix consisting of 
columns (j — l)[logn] + 1 through j[logn] of A, and Xj is the jth set of 
[logn] rows (variables) in x. 

d) There are at most n distinct sums of [log n] variables each of which can be formed 
in at most In addition steps, thereby saving a factor of [log n] . 

TRANSITIVE CLOSURE 

6.11 Let A = [fflij], 1 < j,j < n, be a Boolean matrix that is the adjacency matrix of 
a directed graph G = (V, E) on n = |V| vertices. Give a proof by induction that 
the entry in the rth row and sth column of A = A x A is 1 if there is a path 
containing fc edges from vertex r to vertex s and otherwise. 

6.12 Consider a directed graph G = (V, E) in which each edge carries a label drawn from 
a semiring. Let the entry in the ith row and j/th column of the adjacency matrix of G 
contain the label of the edge between vertices i and j if there is such an edge and the 
empty set otherwise. Assume that the labels of edges in G are drawn from a semiring. 
Show that Theorems 6.4.1, 6.4.2, and 6.4.3 hold for such labeled graphs. 

MATRIX INVERSION 

6.13 Show that over fields the following properties hold for matrix inversion: 

a) The inverse of a 2 x 2 upper (lower) triangular matrix is the matrix with the off- 
diagonal term negated. 

b) The inverse of a 2x2 diagonal matrix is a diagonal matrix in which the ith diagonal 
element is the multiplicative inverse of the ith diagonal element of the original 
matrix. 
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6.14 Show that a lower triangular Toeplitz matrix T can be inverted by a circuit of size 
O(nlogn) and depth (3(log 2 n). 

Hint: Assume that n = 2 , write T as a 2 x 2 matrix of n/2 x n/2 matrices, and 
devise a recursive algorithm to invert T. 

6.15 Exhibit a circuit to compute the characteristic polynomial (J>a(x) of an n x n matrix 
A over a field 1Z that has 0(max(n 3 , v / nM matr i x (n))) field operations and depth 
0(log 2 n). 

Hint: Consider the case n = k . Represent the integer i,0 < i < n — 1, by the unique 
pair of integers (r, s), < r, s < fc — 1, where i = rk + s. Represent the coefficient 
c;+i> < i < n — 2, of <Pa(x) by c, jS . Then we can write <Pa (x) as follows: 

fc-i /fe-i \ 

r=0 \s=0 / 

Show that it suffices to perform k n = n scalar multiplications and k(k— \)n < n 
additions to form the inner sums, k multiplications ofnxn matrices, and kn scalar 
additions to combine these products. In addition, A 2 , A } , . . . , A k ~ 1 and A k , A 2k , . . . , 

A(k-\)k must jj e com p Utec [_ 

6.16 Show that the traces of powers, s r , 1 < r < n, for an n x n matrix A over a field can 
be computed with 0(\A^^matrix(^)) operations. 

Hint: By definition s r = 5^^=i a jj' where af^ is the jth diagonal term of the matrix 
A'' . Let n be a square. Represent r uniquely by a pair (a, b), where 1 < a,b < \fn— 1 
and r = a\fn + b. Then A r = A a v™A b . Thus, ap, can be computed as the product 
of the jth row of A a ^ with the jth column of A 6 . Then, for each j, 1 < j < n, 
form the ^/n x n matrix i?j whose ath row is the jth row of A a ^, < a < y/n — 1. 
Also form the n x ^fn matrix Cj whose 6th column is the jth column of A , 1 < b < 

\/n — 1. Show that the product RjCj contains each of the terms a, for all values 
ofr, 0<r<n— 1 and that the products RjCj, 1 < j < n, can be computed 
efficiently. 

6.17 Show that (6.14) holds by applying the properties of the coefficients of the characteristic 
polynomial of an n x n matrix stated in (6.15). 

Hint: Use proof by induction on I to establish (6.14). 

CONVOLUTION 

6.18 Consider the convolution / c "nv : R n+m i— > R n + m ~ 2 f an n-tuple a with an m- 
tuple b when n <S m. Develop a circuit for this problem whose size is 0(m log n) 
that uses the convolution theorem multiple times. 

Hint: Represent the m-tuple b as sequence of [m/n] n-tuples. 

6.19 The wrapped convolution /^ppcd • ^ 2n l—> ^™ ma P s "--tuples a = (<Xo,a\, . . . , 
a„_i) and b = (bo, b\, . . . , 6„_i), denoted a -k b, to the n-tuple c the jth component 



of which, Cj, is defined as follows: 



2J a r * 6 S 

r+s — j mod n 
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Show that the wrapped convolution on n-tuples contains the standard convolution on 
L(n + l)/2j -tuples as a subfunction and vice versa. 

Hint: In both halves of the problem, it helps to characterize the standard and wrapped 
convolutions as matrix-vector products. It is straightforward to show that the wrapped 
convolution contains the standard convolution as a subfunction. To show the other re- 
sult, observe that the matrix characterizing the standard convolution contains a Toeplitz 
matrix as a submatrix. Consider, for example, the standard convolution of two six- 
tuples. The matrix associated with the wrapped convolution contains a special type of 
Toeplitz matrix. 

6.20 Show that the standard convolution function / CO nv : R 2n | — » R ln ~ 2 is a subfunction 
of the integer multiplication function, f^ u : S 2 «ri°g«l ,_> g2n[-lognl f Section 2.9 
when R is the ring of integers modulo 2. 

Hint: Represent the two sequences to be convolved as binary numbers that have been 
padded with zeros so that at most one bit in a sequence appears among [log n\ posi- 
tions. 

DISCRETE FOURIER TRANSFORM 

6.21 Let n = 2 . Use proof by induction to show that for all elements a of a commutative 
ring 1Z the following identity holds, where \\ is the product operation: 

n-\ fc-i 



5> = n(i+«*: 



6.22 Let n = 2 k and let 1Z be a commutative ring. For u G TZ, uj ^ 0, let m = uo n ' 2 + 1. 
Show that for 1 < p < n 

n-\ 

y u> P3 = mod m 

3=0 

Hint: Represent p as the product of the largest power of 2 with an odd integer and 
apply the result of Problem 6.21. 

6.23 Let n and w be positive powers of two. Let m = uj n ' 2 + 1. Show that in the ring 7L m 
of integers modulo m the integer n has a multiplicative inverse and that u is a principal 
nth root of unity. 

6.24 Let n be even. Use the results of Problems 6.21, 6.22, and 6.23 to show that Zj m , 
the set of integers modulo m, m = 2 tn ' 2 + 1 for any positive integer t > 2, is a 
commutative ring in which U) = 2* is a principal nth root of unity. 

6.25 Let u be a principal nth root of unity of the commutative ring H = (R, +, *, 0, 1). 
Show that lu 2 is a principal (n/2)th root of unit. 

6.26 A circulant is an n x n matrix in which the rth row is the rth cyclic shift of the first 
row, 2 < r < n. When n is a prime, show that computing the DFT of a vector of 
length n is equivalent to multiplying by an (n — 1) x (n — 1) circulant. 

6.27 Show that the multiplication of circulant matrix with a vector can be done by a circuit 
of size 0(n log n) and depth O (log n). 
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Figure 6. 1 3 A bitonic sorter on seven inputs. 



MERGING AND SORTING 

6.28 Prove Theorem 6.8.3. 

6.29 Show that the recurrences given below and stated in the proof of Theorem 6.8.2 have 
the solutions shown, where C(0) = 1 and D(0) = 1: 



C(k) = 2C(k 
D(k) = D(k- 



l) 



1 



= (k+l)2 k 
k+ 1 



6.30 A sequence [X\, X2, ■ • ■ , x n ) is bitonic if there is an integer < k < n such that 
Xi > . . . > x k < ■ ■ ■ < x n . 

a) Show that a bitonic sorting network can be constructed as follows: i) sort (x\, 
x$, x$, . . .) and (x2, X4, x&, . . .) in bitonic sorters whose lines are interleaved, ii) 
compare and interchange the outputs in pairs, beginning with the least significant 
pairs. (See Fig. 6.13.) 

b) Show that two ordered lists can be merged with a bitonic sorter and that an n-sorter 
can be constructed from bitonic sorters. 

c) Determine the number of comparators in a 2 -sorter based on merging with bitonic 
sorters. 



Chapter Notes 



The bulk of this chapter concerns matrix computations, a topic with a long history. Many 
books have been written on this subject to which the interested reader may refer. (See [25], 
[44], [105], [198], and [362].) 

Among the more important recent results in this area are the matrix multiplication algo- 
rithm of Strassen [319]. Many other improvements have been made on this work, among the 
most significant of which is the demonstration by Coppersmith and Winograd [81] that two 
n x n matrices can be multiplied with Oin ) ring operations. 

The relationships between transitive closure and matrix multiplication embodied in Theo- 
rems 6.4.2 and 6.4.3 as well as the generalization of these results to closed semirings are taken 
from the book by Aho, Hopcroft, and Ullman [10]. 
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Winograd [364] demonstrated that matrix multiplication is no harder than matrix inver- 
sion, whereas Aho, Hopcroft, and Ullman [10] demonstrated the converse. 

Csanky's algorithm for matrix inversion is reported in [82]. Leverrier's method for com- 
puting the characteristic function of a matrix is described in [98]. 

Although the FFT algorithm became well known through the work of Cooley and Tukey 
[80], the idea actually begins with Gauss in 1805! (See Heideman, Johnson, and Burrus [130].) 

The zero-one principle for the study of comparator networks is due to Knuth [ 1 70] . Oddly 
enough, Batcher's odd-even merging network is due to Batcher [29] . 

Borodin and Munro [56] is a good early source for arithmetic complexity, the size and 
depth of arithmetic circuits for problems related to matrices and polynomials. More recent 
work on the parallel evaluation of arithmetic circuits is surveyed by Jaja [148, Chapter 8] and 
von zur Gathen [111]. 



CHAPTER 



/ 



Parallel Computation 



Parallelism takes many forms and appears in many guises. It is exhibited at the CPU level when 
microinstructions are executed simultaneously. It is also present when an arithmetic or logic 
operation is realized by a circuit of small depth, as with carry-save addition. And it is present 
when multiple computers are connected together in a network. Parallelism can be available but 
go unused, either because an application was not designed to exploit parallelism or because a 
problem is inherently serial. 

In this chapter we examine a number of explicitly parallel models of computation, includ- 
ing shared and distributed memory models and, in particular, linear and multidimensional 
arrays, hypercube-based machines, and the PRAM model. We give a broad introduction to 
a large and representative set of models, describing a handful of good parallel programming 
techniques and showing through analysis the limits on parallelization. Because of the limited 
use so far of parallel algorithms and machines, the wide range of hardware and software models 
developed by the research community has not yet been fully digested by the computer industry. 

Parallelism in logic and algebraic circuits is also examined in Chapters 2 and 6. The block 
I/O model, which characterizes parallelism at the disk level, is presented in Section 11.6 and 
the classification of problems by their execution time on parallel machines is discussed in Sec- 
tion 8.15.2. 
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7.1 Parallel Computational Models 

A parallel computer is any computer that can perform more than one operation at time. 
By this definition almost every computer is a parallel computer. For example, in the pursuit 
of speed, computer architects regularly perform multiple operations in each CPU cycle: they 
execute several microinstructions per cycle and overlap input and output operations (I/O) (see 
Chapter 11) with arithmetic and logical operations. Architects also design parallel computers 
that are either several CPU and memory units attached to a common bus or a collection of 
computers connected together via a network. Clearly parallelism is common in computer 
science today. 

However, several decades of research have shown that exploiting large-scale parallelism is 
very hard. Standard algorithmic techniques and their corresponding data structures do not 
parallelize well, necessitating the development of new methods. In addition, when parallelism 
is sought through the undisciplined coordination of a large number of tasks, the sheer number 
of simultaneous activities to which one human mind must attend can be so large that it is 
often difficult to insure correctness of a program design. The problems of parallelism are 
indeed daunting. 

Small illustrations of this point are seen in Section 2.7.1, which presents an 0(logn)-step, 
0(n)-gate addition circuit that is considerably more complex than the ripple adder given in 
Section 2.7. Similarly, the fast matrix inversion straight-line algorithm of Section 6.5.5 is more 
complex than other such algorithms (see Section 6.5). 

In this chapter we examine forms of parallelism that are more coarse-grained than is typ- 
ically found in circuits. We assume that a parallel computer consists of multiple processors 
and memories but that each processor is primarily serial. That is, although a processor may 
realize its instructions with parallel circuits, it typically executes only one or a small number of 
instructions simultaneously. Thus, most of the parallelism exhibited by our parallel computer 
is due to parallel execution by its processors. 

We also describe a few programming styles that encourage a parallel style of programming 
and offer promise for user acceptance. Finally, we present various methods of analysis that 
have proven useful in either determining the parallel time needed for a problem or classifying 
a problem according to its need for parallel time. 

Given the doubling of CPU speed every two or three years, one may ask whether we can't 
just wait until CPU performance catches up with demand. Unfortunately, the appetite for 
speed grows faster than increases in CPU speed alone can meet. Today many problems, es- 
pecially those involving simulation of physical systems, require teraflop computers (those per- 
forming 10 12 floating-point operations per second (FLOPS)) but it is predicted that petaflop 
computers (performing 10 FLOPS) are needed. Achieving such high levels of performance 
with a handful of CPUs may require CPU performance beyond what is physically possible at 
reasonable prices. 



7.2 Memoryless Parallel Computers 



The circuit is the premier parallel memoryless computational model: input data passes through 
a circuit from inputs to outputs and disappears. A circuit is described by a directed acyclic 
graph in which vertices are either input or computational vertices. Input values and the re- 
sults of computations are drawn from a set associated with the circuit. (In the case of logic 
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10 11 12 



(a) 
Figure 7.1 Examples of Boolean and algebraic circuits. 




circuits, these values are drawn from the set B = {0, 1} and are called Boolean.) The function 
computed at a vertex is defined through functional composition with values associated with 
computational and input vertices on which the vertex depends. Boolean logic circuits are dis- 
cussed at length in Chapters 2 and 9. Algebraic and combinatorial circuits are the subject of 
Chapter 6. (See Fig. 7.1.) 

A circuit is a form of unstructured parallel computer. No order or structure is assumed 
on the operations that are performed. (Of course, this does not prevent structure from being 
imposed on a circuit.) Generally circuits are a form of fine-grained parallel computer; that 
is, they typically perform low-level operations, such as AND, OR, or NOT in the case of logic 
circuits, or addition and multiplication in the case of algebraic circuits. However, if the set 
of values on which circuits operate is rich, the corresponding operations can be complex and 
coarse-grained. 

The dataflow computer is a parallel computer designed to simulate a circuit computation. 
It maintains a list of operations and, when all operands of an operation have been computed, 
places that operation on a queue of runnable jobs. 

We now examine a variety of structured computational models, most of which are coarse- 
grained and synchronous. 



7.3 Parallel Computers with Memory 



Many coarse-grained, structured parallel computational models have been developed. In this 
section we introduce these models as well as a variety of performance measures for parallel 
computers. 
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There are many ways to characterize parallel computers. A fine-grained parallel computer 

is one in which the focus is on its constituent components, which themselves consist of low- 
level entities such as logic gates and binary memory cells. A coarse-grained parallel computer 
is one in which we ignore the low-level components of the computer and focus instead on its 
functioning at a high level. A complex circuit, such as a carry-lookahead adder, whose details 
are ignored is a single coarse-grained unit, whereas one whose details are studied explicitly is 
fine-grained. CPUs and large memory units are generally viewed as coarse-grained. 

A parallel computer is a collection of interconnected processors (CPUs or memories). The 
processors and the media used to connect them constitute a network. If the processors are 
in close physical proximity and can communicate quickly, we often say that they are tightly 
coupled and call the machine a parallel computer rather than a computer network. How- 
ever, when the processors are not in close proximity or when their operating systems require a 
large amount of time to exchange messages, we say that they are loosely coupled and call the 
machine a computer network. 

Unless a problem is trivially parallel, it must be possible to exchange messages between 
processors. A variety of low-level mechanisms are generally available for this purpose. The use 
of software for the exchange of potentially long messages is called message passing. In a tightly 
coupled parallel computer, messages are prepared, sent, and received quickly relative to the 
clock speed of its processors, but in a loosely coupled parallel computer, the time required for 
these steps is much larger. The time T m to transmit a message from one processor to another 
is generally assumed to be of the form T m = a + 1(5, where I is the length of the message in 
words, a (latency) is the time to set up a communication channel, and /3 (bandwidth) is the 
time to send and receive one word. Both a and [3 are constant multiples of the duration of 
the CPU clock cycle of the processors. Thus, a + (3 is the time to prepare, send, and receive 
a single-word message. In a tightly coupled machine a and /3 are small, whereas in a loosely 
coupled machine a is large. 

An important classification of parallel computers with memory is based on the degree to 
which they share access to memory. A shared-memory computer is characterized by a model 
in which each processor can address locations in a common memory. (See Fig. 7.2(a).) In 
this model it is generally assumed that the time to make one access to the common mem- 
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Figure 7.2 (a) A shared-memory computer; (b) a distributed-memory computer. 
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ory is relatively close to the time for a processor to access one of its registers. Processors in a 
shared-memory computer can communicate with one another via the common memory. The 
distributed-memory computer is characterized by a model in which processors can commu- 
nicate with other processors only by sending messages. (See Fig. 7.2(b).) In this model it is 
generally assumed that processors also have local memories and that the time to send a message 
from one processor to another can be large relative to the time to access a local memory. A third 
type of computer, a cross between the first two, is the distributed shared-memory computer. 
It is realized on a distributed-memory computer on which the time to process messages is large 
relative to the time to access a local memory, but a layer of software gives the programmer the 
illusion of a shared-memory computer. Such a model is useful when programs can be executed 
primarily from local memories and only occasionally must access remote memories. 

Parallel computers are synchronous if all processors perform operations in lockstep and 
asynchronous otherwise. A synchronous parallel machine may alternate between executing 
instructions and reading from local or common memory. (See the PRAM model of Sec- 
tion 7.9, which is a synchronous, shared-memory model.) Although a synchronous parallel 
computational model is useful in conveying concepts, in many situations, as with loosely cou- 
pled distributed computers, it is not a realistic one. In other situations, such as in the design 
of VLSI chips, it is realistic. (See, for example, the discussion of systolic arrays in Section 7.5.) 

7.3.1 Flynn's Taxonomy 

Flynn's taxonomy of parallel computers distinguishes between four extreme types of paral- 
lel machine on the basis of the degree of simultaneity in their handling of instructions and 
data. The single-instruction, single-data (SISD) model is a serial machine that executes one 
instruction per unit time on one data item. An SISD machine is the simplest form of serial 
computer. The single-instruction, multiple-data (SIMD) model is a synchronous parallel 
machine in which all processors that are not idle execute the same instruction on potentially 
different data. (See Fig. 7.3.) The multiple-instruction, single-data (MISD) model de- 
scribes a synchronous parallel machine that performs different computations on the same data. 
While not yet practical, the MISD machine could be used to test the primality of an inte- 
ger (the single datum) by having processors divide it by independent sets of integers. The 
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Figure 7.3 In the SIMD model the same instruction is executed on every processor that is 
not idle. 
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multiple-instruction, multiple-data (MIMD) model describes a parallel machine that runs 
a potentially different program on potentially different data on each processor but can send 
messages among processors. 

The SIMD machine is generally designed to have a single instruction decoder unit that 
controls the action of each processor, as suggested in Fig. 7.3. SIMD machines have not been a 
commercial success because they require specialized processors rather than today's commodity 
processors that benefit from economies of scale. As a result, most parallel machines today are 
MIMD. Nonetheless, the SIMD style of programming remains appealing because programs 
having a single thread of control are much easier to code and debug. In addition, a MIMD 
model, the more common parallel model in use today, can be programmed in a SIMD style. 

While the MIMD model is often assumed to be much more powerful than the SIMD 
one, we now show that the former can be converted to the latter with at most a constant 
factor slowdown in execution time. Let K be the maximum number of different instructions 
executable by a MIMD machine and index them with integers in the set {1, 2, 3, ... , K}. 
Slow down the computation of each machine by a factor K as follows: 1) identify time intervals 
of length K, 2) on the fcth step of the jth interval, execute the fcth instruction of a processor if 
this is the instruction that it would have performed on the jth step of the original computation. 
Otherwise, let the processor be idle by executing its NOOP instruction. This construction 
executes the instructions of a MIMD computation in a SIMD fashion (all processors either 
are idle or execute the instruction with the same index) with a slowdown by a factor K in 
execution time. 

Although for most machines this simulation is impractical, it does demonstrate that in the 
best case a SIMD program is at worst a constant factor slower than the corresponding MIMD 
program for the same problem. It offers hope that the much simpler SIMD programming style 
can be made close in performance to the more difficult MIMD style. 



7.3.2 The Data-Parallel Model 

The data-parallel model captures the essential features of the SIMD style. It has a single 
thread of control in which serial and parallel operations are intermixed. The parallel opera- 
tions possible typically include vector and shifting operations (see Section 2.5.1), prefix and 
segmented prefix computations (see Sections 2.6), and data-movement operations such as are 
realized by a permutation network (see Section 7.8.1). They also include conditional vector 
operations, vector operations that are performed on those vector components for which the 
corresponding component of an auxiliary flag vector has value 1 (others have value 0). 

Figure 7.4 shows a data-parallel program for radix sort. This program sorts n d-bit inte- 
gers, {cc[n], . . . , s[l]}, represented in binary. The program makes d passes over the integers. 
On each pass the program reorders the integers, placing those whose jth least significant bit 
(lsb) is 1 ahead of those for which it is 0. This reordering is stable; that is, the previous or- 
dering among integers with the same jth lsb is retained. After the jth pass, the n integers are 
sorted according to their j least significant bits, so that after d passes the list is fully sorted. 
The prefix function V\. computes the running sum of the jth lsb on the jth pass. Thus, for 
k such that x[k]j = 1 (0), bk (cfc) is the number of integers with index k or higher whose 
jth lsb is 1 (0). The value of cik = bkx[k]j + (c& + b\)x[k]j is bk or Cfc + b\, depending on 
whether the lsb of X[k] is 1 or 0, respectively. That is, a^ is the index of the location in which 
the /cth integer is placed after the jth pass. 
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{ x[n]j is the jth least significant bit of the nth integer. } 

{ After the jth pass, the integers are sorted by their j least significant bits. } 

{ Upon completion, the fcth location contains the fcth largest integer. } 

for j : = to d — 1 
begin 

{b n ,...M):=V^\x[n] j ,...,x[l]j); 

{ bk is the number of Is among x[n]j, . . . ,x[k]j. } 
{ b\ is the number of integers whose jth bit is 1. } 

(c„,..., Cl ) : =^ n) (^N7,...,^[T]7); 

{ Cfc is the number of O's among cc[n]j, . . ., x[fc]j. } 

(a„,...,ai) := [b n x[n]j + (c„ + bi)x[n]j,. ..,b\x[l]j + (ci +bi)x[l]j); 
{a,k = bkx[k]j + (cfc + bi)x[k]j is the rank of the fcth key. } 

(x[n + 1 — a n ],x[n + 1 — O n _i], . . . ,x[n + 1 — a\]) := {x[n],x[n — 1], . . . , x[l]) 
{ This operation permutes the integers. } 
end 

Figure 7.4 A data-parallel radix sorting program to sort n d-bh binary integers that makes two 
uses of the prefix function Vl. . 



The data-parallel model is often implemented using the single-program multiple-data 
(SPMD) model. This model allows copies of one program to run on multiple processors with 
potentially different data without requiring that the copies run in synchrony. It also allows 
the copies to synchronize themselves periodically for the transfer of data. A convenient ab- 
straction often used in the data-parallel model that translates nicely to the SPMD model is the 
assumption that a collection of virtual processors is available, one per vector component. An 
operating system then maps these virtual processors to physical ones. This method is effective 
when there are many more virtual processors than real ones so that the time for interprocessor 
communication is amortized. 

7.3.3 Networked Computers 

A networked computer consists of a collection of processors with direct connections between 
them. In this context a processor is a CPU with memory or a sequential machine designed 
to route messages between processors. The graph of a network has a vertex associated with 
each processor and an edge between two connected processors. Properties of the graph of a 
network, such as its size (number of vertices), its diameter (the largest number of edges on 
the shortest path between two vertices), and its bisection width (the smallest number of edges 
between a subgraph and its complement, both of which have about the same size) characterize 
its computational performance. Since a transmission over an edge of a network introduces 
delay, the diameter of a network graph is a crude measure of the worst-case time to transmit 
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Figure 7.5 Completely balanced (a) and unbalanced (b) trees. 



a message between processors. Its bisection width is a measure of the amount of information 
that must be transmitted in the network for processors to communicate with their neighbors. 

A large variety of networks have been investigated. The graph of a tree network is a tree. 
Many simple tasks, such as computing sums and broadcasting (sending a message from one 
processor to all other processors), can be done on tree networks. Trees are also naturally suited 
to many recursive computations that are characterized by divide-and-conquer strategies, in 
which a problem is divided into a number of like problems of similar size to yield small results 
that can be combined to produce a solution to the original problem. Trees can be completely 
balanced or unbalanced. (See Fig. 7-5.) Balanced trees of fixed degree have a root and bounded 
number of edges associated with each vertex. The diameter of such trees is logarithmic in 
the number of vertices. Unbalanced trees can have a diameter that is linear in the number of 
vertices. 

A mesh is a regular graph (see Section 7.5) in which each vertex has the same degree except 
possibly for vertices on its boundary. Meshes are well suited to matrix operations and can be 
used for a large variety of other problems as well. If, as some believe, speed-of-light limitations 
will be an important consideration in constructing fast computers in the future [43], the one-, 
two-, and three-dimensional mesh may very well become the computer organization of choice. 
The diameter of a mesh of dimension d with n vertices is proportional ton ' . It is not as 
small as the diameter of a tree but acceptable for tasks for which the cost of communication 
can be amortized over the cost of computation. 

The hypercube (see Section 7.6) is a graph that has one vertex at each corner of a mul- 
tidimensional cube. It is an important conceptual model because it has low (logarithmic) 
diameter, large bisection width, and a connectivity for which it is easy to construct efficient 
parallel algorithms for a large variety of problems. While the hypercube and the tree have sim- 
ilar diameters, the superior connectivity of the hypercube leads to algorithms whose running 
time is generally smaller than on trees. Fortunately, many hypercube-based algorithms can be 
efficiently translated into algorithms for other network graphs, such as meshes. 

We demonstrate the utility of each of the above models by providing algorithms that are 
naturally suited to them. For example, linear arrays are good at performing matrix-vector 
multiplications and sorting with bubble sort. Two-dimensional meshes are good at matrix- 
matrix multiplication, and can also be used to sort in much less time than linear arrays. The 
hypercube network is very good at solving a variety of problems quickly but is much more 
expensive to realize than linear or two-dimensional meshes because each processor is connected 
to many more other processors. 
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Figure 7.6 A crossbar connection network. Any two processors can be connected. 



In designing parallel algorithms it is often helpful to devise an algorithm for a particular 
parallel machine model, such as a hypercube, and then map the hypercube and the algo- 
rithm with it to the model of the machine on which it will be executed. In doing this, the 
question arises of how efficiently one graph can be embedded into another. This is the graph- 
embedding problem. We provide an introduction to this important question by discussing 
embeddings of one type of machine into another. 

A connection network is a network computer in which all vertices except for peripheral 
vertices are used to route messages. The peripheral vertices are the computers that are con- 
nected by the network. One of the simplest such networks is the crossbar network, in which 
a row of processors is connected to a column of processors via a two-dimensional array of 
switches. (See Fig. 7.6.) The crossbar switch with 2n computational processors has n routing 
vertices. The butterfly network (see Fig. 7. 1 5) provides a connectivity similar to that of the 
crossbar but with many fewer routing vertices. However, not all permutations of the inputs to 
a butterfly can be mapped to its outputs. For this purpose the Benes network (see Fig. 7.20) 
is better suited. It consists of two butterfly graphs with the outputs of one graph connected to 
the outputs of the second and the order of edges of the second reversed. Many other permuta- 
tion networks exist. Designers of connection networks are very concerned with the variety of 
connections that can be made among computational processors, the time to make these con- 
nections, and the number of vertices in the network for the given number of computational 
processors. (See Section 7.8.) 



7.4 The Performance of Parallel Algorithms 

We now examine measures of performance of parallel algorithms. Of these, computation time 
is the most important. Since parallel computation time T p is a function of p, the number of 
processors used for a computation, we seek a relationship among p, T p , and other measures of 
the complexity of a problem. 

Given a p-processor parallel machine that executes T p steps, in the spirit of Chapter 3, we 
can construct a circuit to simulate it. Its size is proportional to pT p , which plays the role of 
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serial time T s . Similarly, a single-processor RAM of the type used in a p-processor parallel 
machine but with p times as much memory can simulate an algorithm on the parallel machine 
in p times as many steps; it simulates each step of each of the p RAM processors in succession. 
This observation provides the following relationship among p, T p , and T s when storage space 
for the serial and parallel computations is comparable. 

THEOREM 7.4. 1 Let T s be the smallest number of steps needed on a single RAM with storage 
capacity S, in bits, to compute a function f. Iff can be computed in T p steps on a network ofp 
RAM processors, each with storage S/p, then T p satisfies the following inequality: 

pT p > T s (7.1) 

Proof This result follows because, while the serial RAM can simulate the parallel machine 
in pT p steps, it may be able to compute the function in question more quickly. ■ 

The speedup S of a parallel p-processor algorithm over the best serial algorithm for a prob- 
lem is defined as S = T s /T p . We see that, withp processors, a speedup of at mostp is possible; 
that is, S < p. This result can also be stated in terms of the computational work done by serial 
and parallel machines, defined as the number of equivalent serial operations. (Computational 
work is defined in terms of the equivalent number of gate operations in Section 3.1.2. The 
two measures differ only in terms of the units in which work is measured, CPU operations in 
this section and gate operations in Section 3.1.2.) The computational work W p done by an 
algorithm on a p-processor RAM machine is W p = pT p . The above theorem says that the 
minimal parallel work needed to compute a function is at least the serial work required for it, 
that is, W p > W s = T s . (Note that we compare the work on a serial processor to a collection 
ofp identical processors, so that we need not take into account differences among processors.) 

A parallel algorithm is efficient if the work that it does is close to the work done by the 
best serial algorithm. A parallel algorithm is fast if it achieves a nearly maximal speedup. We 
leave unspecified just how close to optimal a parallel algorithm must be for it to be classified as 
efficient or fast. This will often be determined by context. We observe that parallel algorithms 
may be useful if they complete a task with acceptable losses in efficiency or speed, even if they 
are not optimal by either measure. 

7.4. 1 Amdahl's Law 

As a warning that it is not always possible with p processors to obtain a speedup ofp, we intro- 
duce Amdahl's Law, which provides an intuitive justification for the difficulty of parallelizing 
some tasks. In Sections 3.9 and 8.9 we provide concrete information on the difficulty of par- 
allelizing individual problems by introducing the P-complete problems, problems that are the 
hardest polynomial-time problems to parallelize. 

THEOREM 7.4.2 (Amdahl's Law) Let f be the fraction of a program's execution time on a serial 
RAM that is parallelizable. Then the speedup S achievable by this program on a p-processor RAM 
machine must satisfy the following bound: 

S< 



(l-/) + //p 

Proof Given a T s -step serial computation, fT s /p is the smallest possible number of steps 
on a p-processor machine for the parallelizable serial steps. The remaining (1 — f)T s serial 
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steps take at least the same number of steps on the parallel machine. Thus, the parallel time 
T p satisfies T p > T s [(l — f) + f/p] from which the result follows. ■ 

This result shows that if a fixed fraction / of a program's serial execution time can be 
parallelized, the speedup achievable with that program on a parallel machine is bounded above 
by 1/(1 — /) as p grows without limit. For example, if 90% of the time of a serial program 
can be parallelized, the maximal achievable speed is 10, regardless of the number of parallel 
processors available. 

While this statement seems to explain the difficulty of parallelizing certain algorithms, it 
should be noted that programs for serial and parallel machines are generally very different. 
Thus, it is not reasonable to expect that analysis of a serial program should lead to bounds on 
the running time of a parallel program for the same problem. 

7.4.2 Brent's Principle 

We now describe how to convert the inherent parallelism of a problem into an efficient parallel 
algorithm. Brent's principle, stated in Theorem 7.4.3, provides a general schema for exploiting 
parallelism in a problem. 

THEOREM 7.4.3 Consider a computation C that can be done in t parallel steps when the time 
to communicate between operations can be ignored. Let rrii be the number of primitive operations 
done on the ith step and let m = Xa=i m i- Consider a p-processor machine M capable of the 
same primitive operations, where p < max^m^. If the communication time between the operations 
in C on M can be ignored, the same computation can be performed in T p steps on M, where T p 
satisfies the following bound: 

T p < (m/p) + t 

Proof A parallel step in which rrii operations are performed can be simulated by M in 
\mi/p] < (rrii/p) + 1 steps, from which the result follows. ■ 

Brent's principle provides a schema for realizing the inherent parallelism in a problem. 
However, it is important to note that the time for communication between operations can 
be a serious impediment to the efficient implementation of a problem on a parallel machine. 
Often, the time to route messages between operations can be the most important limitation 
on exploitation of parallelism. 

We illustrate Brent's principle with the problem of adding n integers, x\, ■ . ■ , x n , n = 2 . 
Under the assumption that at most two integers can be added in one primitive operation, we 
see that the sum can be formed by performing njl additions, n/4 additions of these results, 
etc., until the last sum is formed. Thus, rrii = n/2 z for i < [log 2 ri\ . When only p processors 
are available, we assign \n/p~\ integers to p — 1 processors and n—(p — 1 ) |~n/p] integers to the 
remaining processor. In \n/p~\ steps, the p processors each compute their local sums, leaving 
their results in a reserved location. In each subsequent phase, half of the processors active in the 
preceding phase are active in this one. Each active processor fetches the partial sum computed 
by one other processor, adds it to its partial sum, and stores the result in a reserved place. After 
0(logp) phases, the sum of the n integers has been computed. This algorithm computes the 
sum of the n integers in 0(n/p + logp) time steps. Since the maximal speedup possible is 
p, this algorithm is optimal to within a constant multiplicative factor if logp < (n/p) or 
p < nj log n. 
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It is important to note that the time to communicate between processors is often very 
large relative to the length of a CPU cycle. Thus, the assumption that it takes zero time to 
communicate between processors, the basis of Brent's principle, holds only for tightly coupled 
processors. 

7.5 Multidimensional Meshes 

In this section we examine multidimensional meshes. A one-dimensional mesh or linear 

array of processors is a one-dimensional (ID) array of computing elements connected via 
nearest-neighbor connections. (See Fig. 7.7.) If the vertices of the array are indexed with 
integers from the set {1, 2, 3, . . . , n], then vertex i, 2 < i < n — 1, is connected to vertices 
i — \ and i + 1 . If the linear array is a ring, vertices 1 and n are also connected. Such an 
end-to-end connection can be made with short connections by folding the linear array about 
its midpoint. 

The linear array is an important model that finds application in very large-scale integrated 
(VLSI) circuits. When the processors of a linear array operate in synchrony (which is the 
usual way in which they are used), it is called a linear systolic array (a systole is a recurrent 
rhythmic contraction, especially of the heart muscle). A systolic array is any mesh (typically 
ID or 2D) in which the processors operate in synchrony. The computing elements of a systolic 
array are called cells. A linear systolic array that convolves two binary sequences is described 
in Section 1.6. 

A multidimensional mesh (see Fig. 7.8) (or mesh) offers better connectivity between pro- 
cessors than a linear array. As a consequence, a multidimensional mesh generally can compute 
functions more quickly than a ID one. We illustrate this point by matrix multiplication on 
2D meshes in Section 7.5.3. 

Figure 7.8 shows 2D and 3D meshes. Each vertex of the 2D mesh is numbered by a pair 
(r,c), where < r < n — 1 and < c < n — 1 are its row and column indices. (If the cell 
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Figure 7.7 A linear array to compute the matrix-vector product Ax, where A = [dij] and 
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On each cycle, the ith processor sets its current sum, Si , to the sum to its 



right, iSi+i, plus the product of its local value, Xi, with its vertical input. 
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Figure 7.8 (a) A two-dimensional mesh with optional connections between the boundary 
elements shown by dashed lines, (b) A 3D mesh (a cube) in which elements are shown as subcubes. 



(r, c) is associated with the integer rn + c, this is the row-major order of the cells. Cells are 
numbered left-to-right from to 3 in the first row, 4 to 7 in the second, 8 to 1 1 in the third, 
and 12 to 15 in the fourth.) Vertex (r, c) is adjacent to vertices (r — 1, c) and (r + 1, c) for 
1 < r < n — 2. Similarly, vertex (r, c) is adjacent to vertices (r,c — 1) and (r,c + 1) for 
1 < c < n — 2. Vertices on the boundaries may or may not be connected to other boundary 
vertices, and may be connected in a variety of ways. For example, vertices in the first row 
(column) can be connected to those in the last row (column) in the same column (row) (this is 
a toroidal mesh) or the next larger column (row). The second type of connection is associated 
with the dashed lines in Fig. 7.8(a). 

Each vertex in a 3D mesh is indexed by a triple (x, y, z), < x, y, z < n — 1, as suggested 
in Fig. 7.8(b). Connections between boundary vertices, if any, can be made in a variety of 
ways. Meshes with larger dimensionality are defined in a similar fashion. 

A (i-dimensional mesh consists of processors indexed by a d-tuple (n\,Ti2, ■ ■ -,nd) in 
which < rij < Nj — 1 for 1 < j < d. If processors (n 1; n 2 , . . . , nj) and {rrii,m2, ■ ■ ■ , m<j) 



are adjacent, there is some j such that m = mi for j ^= i and \rij 



1 . There may also 



be connections between boundary processors, that is, processors for which one component of 
their index has either its minimum or maximum value. 

7.5. 1 Matrix- Vector Multiplication on a Linear Array 

As suggested in Fig. 7.7, the cells in a systolic array can have external as well as nearest-neighbor 
connections. This systolic array computes the matrix-vector product Ax of an n x n matrix 
with an n-vector. (In the figure, n = 3.) The cells of the systolic array beat in a rhythmic 
fashion. The ith processor sets its current sum, Si, to the product of Xt with its vertical input 
plus the value of Sj+i to its right (the value is read by the rightmost cell). Initially, Si = for 
I < i < n. Since alternating vertical inputs are 0, the alternating values of Si are 0. In Fig. 7.7 
the successive values of S3 are A\^x^, 0, ^23^3, 0, A33X3, 0, 0. The successive values of S2 
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are 0, Ay^Xi + ^i,3^3> 0, ^2,2X2 + A23X3, 0, A32X2 + A33X3, 0. The successive values of Si 
are 0, 0, Ai t \X\ + A12X2 + A13X3, 0, ^2,1 2; 1 +^2,2^2 + ^-2,3^3! 0, A^\X\ + A$ t 2X2 + A53X5. 
The algorithm described above to compute the matrix-vector product for a 3 X 3 matrix 
clearly extends to arbitrary n x n matrices. (See Problem 7.8.) Since the last element of an 
n x n matrix arrives at the array after 3n — 2 time steps, such an array will complete its task in 
3ra— 1 time steps. A lower bound on the time for this problem (see Problem 7.9) can be derived 
by showing that the n entries of the matrix A and the n entries of the matrix x must be read 
to compute Ax correctly by an algorithm, whether serial or not. By Theorem 7.4.1 it follows 
that all systolic algorithms using n processors require n steps. Thus, the above algorithm is 
nearly optimal to within a constant multiplicative factor. 

THEOREM 7.5. 1 There exists a linear systolic array with n cells that computes the product of an 
n x n matrix with an n-vector in 3n — 1 steps, and no algorithm on such an array can do this 
computation in fewer than n steps. 

Since the product of two n x n matrices can be realized as n matrix-vector products with 
an n x n matrix, an n-processor systolic array exists that can multiply two matrices nearly 
optimally. 

7.5.2 Sorting on Linear Arrays 

A second application of linear systolic arrays is bubble sorting of integers. A sequential version 
of the bubble sort algorithm passes over the entries in a tuple (x\, X2, ■ • ■ , x n ) from left to 
right multiple times. On the first pass it finds the largest element and moves it to the rightmost 
position. It applies the same procedure to the first n— \ elements of the resultant list, stopping 
when it finds a list containing one element. This sequential procedure takes time proportional 

to n + (n - 1) + (n - 2) -\ h 2 + 1 = n(n + l)/2. 

A parallel version of bubble sort, sometimes called odd-even transposition sort, is natu- 
rally realized on a linear systolic array. The n entries of the array are placed in n cells. Let c, 
be the word in the «h cell. We assume that in one unit of time two adjacent cells can read 
words stored in each others memories (c, and Ci + {), compare them, and swap them if one (ci) 
is larger than the other (cj+i). The odd-even transposition sort algorithm executes n stages. 
In the even-numbered stages, integers in even-numbered cells are compared with integers in 
the next higher numbered cells and swapped, if larger. In the odd-numbered stages, the same 
operation is performed on integers in odd-numbered cells. (See Fig. 7.9.) We show that in n 
steps the sorting is complete. 

THEOREM 7.5.2 Bubble sort ofn elements on a linear systolic array can be done in at most n steps. 
Every algorithm to sort a list ofn elements on a linear systolic array requires at least n — 1 steps. 
Thus, bubble sort on a linear systolic array is almost optimal. 

Proof To derive the upper bound we use the zero-one principle (see Theorem 6.8.1), which 
states that if a comparator network for inputs over an ordered set A correctly sorts all binary 
inputs, it correctly sorts all inputs. The bubble sort systolic array maps directly to a com- 
parator network because each of its operations is data-independent, that is, oblivious. To 
see that the systolic array correctly sorts binary sequences, consider the position, r, of the 
rightmost 1 in the array. 
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Figure 7.9 A systolic implementation of bubble sort on a sequence of five items. Underlined 
pairs of items are compared and swapped if out of order. The bottom row shows the first set of 
comparisons. 



If r is even, on the first phase of the algorithm this 1 does not move. However, on all 
subsequent phases it moves right until it arrives at its final position. If r is odd, it moves 
right on all phases until it arrives in its final position. Thus by the second step the rightmost 
1 moves right on every step until it arrives at its final position. The second rightmost 1 is 
free to move to the right without being blocked by the first 1 after the second phase. This 
second 1 will move to the right by the third phase and continue to do so until it arrives at 
its final position. In general, the fcth rightmost 1 starts moving to the right by the (k + l)st 
phase and continues until it arrives at its final position. It follows that at most n phases are 
needed to sort the 0-1 sequence. By the zero-one principle, the same applies to all sequences. 

To derive the lower bound, assume that the sorted elements are increasing from left to 
right in the linear array. Let the elements initially be placed in decreasing order from left 
to right. Thus, the process of sorting moves the largest element from the leftmost location 
in the array to the rightmost. This requires at least 71—1 steps. The same lower bound 
holds if some other permutation of the n elements is desired. For example, if the fcth largest 
element resides in the rightmost cell at the end of the computation, it can reside initially in 
the leftmost cell, requiring at least n — \ operations to move to its final position. ■ 



7.5.3 Matrix Multiplication on a 2D Mesh 

2D systolic arrays are natural structures on which to compute the product C = A x B of 
matrices A and B. (Matrix multiplication is discussed in Section 6.3.) Since C = A x B can 
be realized as n matrix-vector multiplications, C can be computed with n linear arrays. (See 
Fig. 7.7.) If the columns of B are stored in successive arrays and the entries of A pass from 
one array to the next in one unit of time, the nth array receives the last entry of B after 4n — 2 
time steps. Thus, this 2D systolic array computes C = A x B in 4n — 1 steps. Somewhat 
more efficient 2D systolic arrays can be designed. We describe one of them below. 
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Figure 7.10 shows a 2D mesh for matrix multiplication. Each cell of this mesh adds to 
its stored value the product of the value arriving from above and to its left. These two values 
pass through the cells to those below and to their right, respectively. When the entries of A are 
supplied on the left and those of B are supplied from above in the order shown, the cell Ctj 
computes c^j, the (i, j) entry of the product matrix C. For example, cell C23 accumulates the 
value C23 = 02,1 * ^1,3 + a 2,2 * ^2,3 + ^2,3 * ^3,3- After the entries of C have been computed, 
they are produced as outputs by shifting the entries of the mesh to one side of the array. When 
generalized tonxw matrices, this systolic array requires 2n — 1 steps for the last of the matrix 
components to enter the array, and another n — 1 steps to compute the last entry c n> „. An 
additional n steps are needed to shift the components of the product matrix out of the array. 
Thus, this systolic array performs matrix multiplication in An — 2 steps. 

We put the following requirements on every systolic array (of any dimension) that com- 
putes the matrix multiplication function: a) each component of each matrix enters the array 
at one location, and b) each component of the product matrix is computed at a unique cell. 
We now show that the systolic matrix multiplication algorithm is optimal to within a constant 
multiplicative factor. 

THEOREM 7.5.3 Two n x n matrices can be multiplied by ann x n systolic array in An — 2 steps 
and every two-dimensional systolic array for this problem requires at least (n/2) — 1 steps. 

Proof The proof that two n x n matrices can be multiplied in An — 2 steps by a two- 
dimensional systolic array was given above. We now show that Q(n) steps are required to 
multiply two n x n matrices, A and B, to produce the matrix C = A X B. Observe that 
the number of cells in a two-dimensional array that are within d moves from any particular 
cell is at most a(d), where cr(d) = 2d + 2d + 1. The maximum occurs at the center of the 
array. (See Problem 7.11.) 
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Figure 7.10 A two-dimensional mesh for the multiplication of two matrices. The entries in 
these matrices are supplied in successive time intervals to processors on the boundary of the mesh. 
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Given a systolic array with inputs supplied externally over time (see Fig. 7. 1 0), we enlarge 
the array so that each component of each matrix is initially placed in a unique cell. The 
enlarged array contains the original n X n array. 
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, it follows that for each value of i, j, t, and 



u there is a path from a,i >u to the cell at which cjj is computed as well as a path from b t j to 
this same cell. Thus, it follows that there is a path in the array between arbitrary entries Oj iU 
and b t j of the matrices A = [a^ u ] and B = \b t j]. Let s be the maximum number of array 
edges between an element of A or B and an element of C on which it depends. It follows 
that at least s steps are needed to form C and that every element of A and B is within dis- 
tance 2s. Furthermore, each of the 2n 2 elements of A and B is located initially in a unique 
cell of the expanded systolic array. Since there are at most a(2s) vertices within a distance 
of 2s, it follows that a (2s) = 2(2s) 2 + 2(2s) + 1 > 2n 2 , from which we conclude that the 



> 



1. 



number of steps to multiply n x n matrices is at least s > \(n 2 — \) 1 ' 2 

7.5.4 Embedding of ID Arrays in 2D Meshes 

Given an algorithm for a linear array, we ask whether that algorithm can be efficiently realized 
on a 2D mesh. This is easily determined: we need only specify a mapping of the cells of a linear 
array to cells in the 2D mesh. Assuming that the two arrays have the same number of cells, a 
natural mapping is obtained by giving the cells of an n x n mesh the snake-row ordering. (See 
Fig. 7.11.) In this ordering cells of the first row are ordered from left to right and numbered 
from to n — 1 ; those in the second row are ordered from right to left and numbered from 
n to 2n — 1 . This process repeats, alternating between ordering cells from left to right and 
right to left and numbering the cells in succession. Ordering the cells of a linear array from 
left to right and numbering them from to n — 1 allows us to map the linear array directly 
to the 2D mesh. Any algorithm for the linear array runs in the same time on a 2D mesh if the 
processors in the two cases are identical. 

Now we ask if, given an algorithm for a 2D mesh, we can execute it on a linear array. The 
answer is affirmative, although the execution time of the algorithm may be much greater on the 
ID array than on the 2D mesh. As a first step, we map vertices of the 2D mesh onto vertices 
of the 1 D array. The snake-row ordering of the cells of an n x n array provides a convenient 
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Figure 7. 1 I Snake-row ordering of the vertices of a two-dimensional mesh. 
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mapping of the cells of the 2D mesh onto the cells of the linear array with n cells. We assume 
that each of the cells of the linear array is identical to a cell in the 2D mesh. 

We now address the question of communication between cells. When mapped to the ID 
array, cells can communicate only with their two immediate neighbors in the array. However, 
cells on the nxn mesh can communicate with as many as four neighbors. Unfortunately, cells 
in one row of the 2D mesh that are neighbors of cells in an adjacent row are mapped to cells 
that are as far as 2n — 1 cells away in the linear array. We show that with a factor of 8n — 2 
slowdown, the linear array can simulate the 2D mesh. A slowdown by at least a factor of n/2 
is necessary for those problems and data for which a datum moves from the first to the last 
entry in the array (in n 2 — 1 steps) to simulate a movement that takes 2n — 1 steps on the 
array, {{n 2 - l)/(2n - 1) > n/2 for n > 2.) 

Given an algorithm for a 2D mesh, slow it down as follows: 

a) Subdivide each cycle into six subcycles. 

b) In the first of these subcycles let each cell compute using its local data. 

c) In the second subcycle let each cell communicate with neighbor(s) in adjacent columns. 

d) In the third subcycle let cells in even-numbered rows send messages to cells in the next 
higher numbered rows. 

e) In the fourth subcycle let cells in even-numbered rows receive messages from cells in the 
next higher numbered rows. 

f) In the fifth subcycle let cells in odd-numbered rows send messages to cells in next higher 
numbered rows. 

g) In the sixth subcycle let cells in odd-numbered rows receive messages from cells in next 
higher numbered rows. 

When the revised 2D algorithm is executed on the linear array, computation occurs in the 
first subcycle in unit time. During the second subcycle communication occurs in unit time 
because cells that are column neighbors in the 2D mesh are adjacent in the ID array. The 
remaining four subcycles involve communication between pairs of groups of n cells each. This 
can be done for all pairs in 2n — 1 time steps: each cell shifts a datum in the direction of the 
cell for which it is destined. After 2n — 1 steps it arrives and can be processed. We summarize 
this result below. 

THEOREM 7.5.4 Any T-step systolic algorithm on an n x n array can be simulated on a linear 
systolic array with n cells in at most (8n — 2)T steps. 

In the next section we demonstrate that hypercubes can be embedded into meshes. From 
this result we derive mesh-based algorithms for a variety of problems from hypercube-based 
algorithms for these problems. 



7.6 Hypercube-Based Machines 



A d-dimensional hypercube has 2 vertices. When they are indexed by binary d-tuples (a^, 
a<i-\, • . • , a ), adjacent vertices are those whose tuples differ in one position. Thus, the 2D 
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Figure 7. 1 2 Hypercubes in two, three, and four dimensions. 



hypercube is a square, the 3D hypercube is the traditional 3-cube, and the four-dimensional 
hypercube consists of two 3-cubes with edges between corresponding pairs of vertices. (See 
Fig. 7.12.) The d-dimensional hypercube is composed of two (d— 1) -dimensional hypercubes 
in which each vertex in one hypercube has an edge to the corresponding vertex in the other. 
The degree of each vertex in a d-dimensional hypercube is d and its diameter is d as well. 

While the hypercube is a very useful model for algorithm development, the construction 
of hypercube-based networks can be costly due to the high degree of the vertices. For example, 
each vertex in a hypercube with 4,096 vertices has degree 12; that is, each vertex is connected to 
12 other vertices, and a total of 49,152 connections are necessary among the 4,096 processors. 
By contrast, a 2 x 2 2D mesh has the same number of processors but at most 16,384 wires. 
The ratio between the number of wires in a d-dimensional hypercube and a square mesh with 
the same number of vertices is d/4. This makes it considerably more difficult to realize a 
hypercube of high dimensionality than a 2D mesh with a comparable number of vertices. 

7.6. 1 Embedding Arrays in Hypercubes 

Given an algorithm designed for an array, we ask whether it can be efficiently realized on 
a hypercube network. The answer is positive. We show by induction that if d is even, a 
2 ' x 2 ' array can be embedded into a rf-dimensional, 2 -vertex hypercube and if d is odd, 
a 2*- d+1 il 2 x 2*- d ~ 1 '' 2 array can be embedded into a d-dimensional hypercube. The base cases 
are d = 2 and d = 3. 
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Figure 7. 1 3 Mappings of 2 x 2, 4 x 2, and 4x4 arrays to two-, three-, and four-dimensional 
hypercubes. The binary tuples identify vertices of a hypercube. 
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When d = 2, a 2 ' x 2 ' array is a 2 x 2 array that is itself a four-vertex hypercube. 
When d = 3, a 2< d+1 )/ 2 x l^- 1 )/ 2 array is a 4 x 2 array. (See Fig. 7.13, page 299.) It 
can be embedded into a three-dimensional hypercube by mapping the top and bottom 2x2 
subarrays to the vertices of the two 2-cubes contained in the 3-cube. The edges between the 
two subarrays correspond directly to edges between vertices of the 2-cubes. 

Applying the same kind of reasoning to the inductive hypothesis, we see that the hypothesis 
holds for all values of d > 2. If a 2D array is not of the form indicated, it can be embedded 
into such an array whose sides are a power of 2 by at most quadrupling the number of vertices. 



7.6.2 Cube-Connected Cycles 

A reasonable alternative to the hypercube is the cube-connected cycles (CCC) network shown 
in Fig. 7. 14. Each of its vertices has degree 3, yet the graph has a diameter only a constant factor 
larger than that of the hypercube. The (d, r)-CCC is defined in terms of a d-dimensional hy- 
percube when r > d. Let (a<j_i, a.d-2-, ■ ■ ■ , «o) ar >d (bd-i> bd-i, ■ ■ ■ , bo) be the indices of two 
adjacent vertices on the d-dimensional hypercube. Assume that these tuples differ in the jth 



component, < j < d— 1; that is, dj 



1 1 and a,i = b t for i ^ j. Associated with vertex 



(ctd—u ■ ■ ■ > a p> ■ ■ ■ > a o) of the hypercube are the vertices (p, a^_ 
r — 1, of the CCC that form a ring; that is, vertex (p, dd-U ■ 



,a p ,.. -,a ), < p < 
,,..., a ) is adjacent to 



vertices ((p + 1) mod r,a,d-\, ■ ■ .a p , . . . , a ) and ((p — 1) mod r, dd-u 



■ a ) 



In addition, for < p < d — 1, vertex (p,dd- 
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Figure 7. 1 4 The cube-connected cycles network replaces each vertex of a d-dimensional hyper- 
cube with a ring of r > d vertices in which each vertex is connected to its neighbor on the ring. 
The Jth ring vertex, < j ' < d — 1, is also connected to the Jth ring vertex at an adjacent corner 
of the original hypercube. 
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The diameter of the CCC is at most 3r/2 + d, as we now show. Given two vertices 
V\ = (p, dd-i, ■ ■ ■ ,CJo) and V2 = (q,bd-i, ■ ■ ■ ,bo), let their hypercube addresses a = 
(ad-i, ■ ■ ■ , do) and b = (bd-i, . . . , &o) differ in k positions. To move from V\ to v 2 , move 
along the ring containing V\ by decreasing processor numbers until reaching the next lower 
index at which a and b differ. (Wrap around to the highest index, if necessary.) Move from 
this ring to the ring whose hypercube address differs in this index. Move around this ring until 
arriving at the next lower indexed processor at which a and b differ. Continue in this fashion 
until reaching the ring with hypercube address b. The number of edges traversed in this phase 
of the movement is at most one for each vertex on the ring plus at most one for each of the 
k < d positions on which the addresses differ. Finally, move around the last ring toward the 
vertex v 2 along the shorter path. This requires at most r/2 edge traversals. Thus, the maximal 
distance between two vertices, the diameter of the graph, is at most 3r/2 + d. 



7.7 Normal Algorithms 



Normal algorithms on hypercubes are systolic algorithms with the property that in each cycle 
some bit position in an address is chosen and data is exchanged only between vertices whose 
addresses differ in this position. An operation is then performed on this data in one or both 
vertices. Thus, if the hypercube has three dimensions and the chosen dimension is the second, 
the following pairs of vertices exchange data and perform operations on them: (0,0,0) and 
(0, 1, 0),(0, 0,1) and (0,1,1), (1,0,0) and (1,1,0), and (1,0, 1) and (1,1,1). A fully nor- 
mal algorithm is a normal algorithm that visits each of the dimensions of the hypercube in 
sequence. There are two kinds of fully normal algorithms, ascending and descending algo- 
rithms; ascending algorithms visit the dimensions of the hypercube in ascending order, whereas 
descending algorithms visit them in descending order. We show that many important algo- 
rithms are fully normal algorithms or combinations of ascending and descending algorithms. 
These algorithms can be efficiently translated into mesh-based algorithms, as we shall see. 

The fast Fourier transform (FFT) (see Section 6.7.3) is an ascending algorithm. As sug- 
gested in the butterfly graph of Fig. 7.15, if each vertex at each level in the FFT graph on 
n = 2 inputs is indexed by a pair (I, a), where a is a binary d-tuple and < I < d, then 
at level I pairs of vertices are combined whose indices differ in their Zth component. (See 
Problem 7.14.) It follows that the FFT graph can be computed in levels on the d-dimensional 
hypercube by retaining the values corresponding to the column indexed by a in the hypercube 
vertex whose index is a. It follows that the FFT graph has exactly the minimal connectiv- 
ity required to execute an ascending fully normal algorithm. If the directions of all edges 
are reversed, the graph is exactly that needed for a descending fully normal algorithm. (The 
convolution function fdonv '■ R n+rn i— > Jl n + m - 1 over a commutative ring 1Z can also be 
implemented as a normal algorithm in time 0(log n) on an n-vertex hypercube, n = 2 . See 
Problem 7.15.) 

Similarly, because the graph of Batchers bitonic merging algorithm (see Section 6.8.1) is 
the butterfly graph associated with the FFT, it too is a normal algorithm. Thus, two sorted lists 
of length n = 2 can be merged in d = log 2 n steps. As stated below, because the butterfly 
graph on 2 inputs contains butterfly subgraphs on 2 inputs, k < d, a recursive normal 
sorting algorithm can be constructed that sorts on the hypercube in 0(log n) steps. The 
reader is asked to prove the following theorem. (See Problem 6.29.) 
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Figure 7. 1 5 The FFT butterfly graph with column numberings. The predecessors of vertices 
at the fcth level differ in their fcth least significant bits. 



THEOREM 7.7. 1 There exists a normal sorting algorithm on thep-vertex hypercube, p = 2 , that 
sorts p items in time 0(log p). 

Normal algorithms can also be used to perform a sum on the hypercube and broadcast 
on the hypercube, as we show. We give an ascending algorithm for the first problem and a 
descending algorithm for the second. 



7.7. 1 Summing on the Hypercube 



, ao) denote an address of a 
Thus, when 



Let the hypercube be d-dimensional and let a = (a^-i. o-d-2, ■ 

vertex. Associate with a the integer \a\ = Od_i2 + ad-i^"' +-•■• + (Xo. 

d = 3, the addresses {0, 1, 2, . . . , 7} are associated with the eight 3-tuples {(0, 0, 0), (0, 0, 1), 

(0, 1,0),..., (1, 1, 1)}, respectively. 

Let V(|a|) denote the value stored at the vertex with address a. For each (d — 1) tuple 
(dd-u ■ ■ ■ > a i)> send to vertex (a<j-i> . . . ,Oi, 0) the value stored at vertex (a<j_i, . . . ,a\, 1). 
In the summing problem we store at vertex (a^-i, . . . , ai , 0) the sum of the original values 
stored at vertices (o<j_i, . . . , ai, 0) and (o<j_i, . . . , ai, 1). Below we show the transmission 
(e.g. V(0) <- V(l)) and addition (e.g. V(0) <- ^(0) + V(l)) that result for d = 3: 



V(0) - 


V(l), 


V(0) - 


V(0) + V(l) 


V(2) * 


V(3), 


V(2) . 


- V(2) + V(3) 


V(4) ♦ 


V(5), 


V(4) . 


V(4) + V(5) 


V(6) . 


V(7), 


V(6) . 


V(6) + V(7) 



For each (d — 2) tuple (dd—i, • ■ • , 02) we then send to vertex (a,d—i, ■ ■ ■ , a-i, 0, 0) the value 
stored at vertex (a<j_i, . . . , dj, 1, 0). Again for d = 3, we have the following data transfers and 
additions: 



©John E Savage 7.7 Normal Algorithms 303 



7(0) <- V{2), 7(0) <- F(0) + 7(2), 

7(4) <- 7(6), 7(4) <- 7(4) + 7(6), 

We continue in this fashion until reaching the lowest dimension of the (i-tuples at which point 
we have the following actions when d = 3: 

7(0) <- 7(4), 7(0) <- 7(0) + 7(4) 

At the end of this computation, 7(0) is the sum of the values stored in all vertices. This 
algorithm for computing 7(0) can be extended to any associative binary operator. 

7.7.2 Broadcasting on the Hypercube 

The broadcast operation is obtained by reversing the directions of each of the transmissions 
described above. Thus, in the example, 7(0) is sent to V(4) in the first stage, in the second 
stage 7(0) and 7(4) are sent to 7(2) and V(6), respectively, and in the last stage, 7(0), 
7(2), V{4), and V{6) are sent to 7(1), 7(3), 7(5), and V(7), respectively. 

The algorithm given above to broadcast from one vertex to all others in a hypercube can be 
modified to broadcast to just the vertices in a subhypercube that is defined by those addresses 
a = (od—i, o,d-2> • ■ • > a o) m which all bits are fixed except for those in some k positions. 
For example, {(0,0,0), (0, 1, 0), (1, 0, 0), (1, 1,0)} are the vertices of a subhypercube of the 
three-dimensional hypercube (the rightmost bit is fixed). To broadcast to each of these vertices 
from (0, 1,0), say, on the first step send the message to its pair along the second dimension, 
namely, (0,0,0). On the second step, let these pairs send messages to their pairs along the 
third dimension, namely, (0, 1, 0) — > (1, 1, 0) and (0, 0, 0) — > (1, 0, 0). This algorithm can be 
generalized to broadcast from any vertex in a hypercube to all other vertices in a subhypercube. 
Values at all vertices of a subhypercube can be associatively combined in a similar fashion. 

The performance of these normal algorithms is summarized below. 

THEOREM 7.7.2 Broadcasting from one vertex in a d-dimensional hypercube to all other vertices 
can be done with a normal algorithm in 0(d) steps. Similarly, the associative combination of the 
values stored at the vertices of a d-dimensional hypercube can be done with a normal algorithm 
in O(d) steps. Broadcasting and associative combining can also be done on the vertices of Tri- 
dimensional subcube of the d-dimensional hypercube in O(k) steps with a normal algorithm. 



7.7.3 Shifting on the Hypercube 

Cyclic shifting can also be done on a hypercube as a normal algorithm. For n = 2 , consider 
shifting the ra-tuple x = (x n —\, . . . , Xq) cyclically left by k places on a d-dimensional hyper- 
cube. If k < n/2 (see Fig. 7.16(a)), the largest element in the right half of a;, namely X n /2—\, 
moves to the left half of a;. On the other hand, if k > n/2 (see Fig. 7.16(b)), x„/ 2 -i moves 
to the right half of x. 

Thus, to shift x left cyclically by k places, k < n/2, divide x into two (n/2) -tuples, 
shift each of these tuples cyclically by k places, and then swap the rightmost k components 
of the two halves, as suggested in Fig. 7.16(a). The swap is done via edges across the highest 



304 



Chapter 7 Parallel Computation 



Models of Computation 







k 








^ 








, 


111111 
















" 





■■n/2 



k 




™ 






111 






lii 


J * — " ; ► 











\ fc 








^ i 






\; ; 


^M§§ 




II 




Ifffll 























'..n/2 





k — n/2 




.^ 


111 




I 











Figure 7. 1 6 The two cases of a normal algorithm for cyclic shifting on a hypercube. 



dimension of the hypercube. When k > n/2, cyclically shift each (n/2)-tuple by k — n/2 
positions and then swap the high-order n — k positions from each tuple across the highest 
dimension of the hypercube. We have the following result. 

THEOREM 7.7.3 Cyclic shifting of an n-tuple, n = 2 , by any amount can be done recursively by 
a normal algorithm in log 2 n communication steps. 

7.7 A Shuffle and Unshuffle Permutations on Linear Arrays 

Because many important algorithms are normal and hypercubes are expensive to realize, it 
is preferable to realize normal algorithms on arrays. In this section we introduce the shuffle 
and unshuffle permutations, show that they can be used to realize normal algorithms, and then 
show that they can be realized on linear arrays. We use the unshuffle algorithms to map normal 
hypercube algorithms onto one- and two-dimensional meshes. 

Let N(n) = {0, 1,2, ... ,n — 1} and n = 2 d . The shuffle permutation 7T^ ffle : 

]N(n) i— > ]N(n) moves the item in position a to position i"shufflo( a )' wnere 7r shufflo( a ) ls tne 
integer represented by the left cyclic shift of the <i-bit binary number representing a. For exam- 
ple, when n = 8 the integer 3 is represented by the binary number Oil and its left cyclic shift 

(8) 

is 110. Thus, 7Tshufflc(3) = 6. The shuffle permutation of the sequence {0, 1, 2, 3,4, 5, 6, 7} 
is the sequence {0, 4, 1, 5, 2, 6, 3, 7}. A shuffle operation is analogous to interleaving of the 
two halves of a sorted deck of cards. Figure 7.17 shows this mapping for n = 8. 

The unshuffle permutation 7Tu n shufflo ■ IN( n ) l— * IN(n) reverses the shuffle operation: it 



» 



» 



moves the item in position b to position a where b = i"^gj e (a); that is, a = 7r„ n s hu gi e (6) = 
TTunshuffle ('""shuffle (a))- Figure 7.18 shows this mapping for n = 8. The shuffle permutation 
is obtained by reversing the directions of edges in this graph. 

An unshuffle operation can be performed on an ?7.-cell linear array, n = 2 , by assuming 
that the cells contain the integers {0, 1, 2, . . . , n — 1} from left to right represented as di- 
bit binary integers and then sorting them by their least significant bit using a stable sorting 
algorithm. (A stable sorting algorithm is one that does not change the original order of keys 
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Figure 7. 1 7 The shuffle permutation can be realized by a series of swaps of the contents of cells. 
The cells between which swaps are done have a heavy bar above them. The result of swapping cells 
of one row is shown in the next higher row, so that the top row contains the result of shuffling the 
bottom row. 



with the same value.) When this is done, the sequence {0, 1, 2, 3, 4, 5, 6, 7} is mapped to the 
sequence {0, 2, 4, 6, 1, 3, 5, 7}, the unshuffled sequence, as shown in Fig. 7.18. The integer 
b is mapped to the integer a whose binary representation is that of b shifted cyclically right 
by one position. For example, position 1 (001) is mapped to position 4 (100) and position 6 
(110) is mapped to position 3 (011). 

Since bubble sort is a stable sorting algorithm, we use it to realize the unshuffle permuta- 
tion. (See Section 7.5.2.) In each phase keys (binary tuples) are compared based on their least 
significant bits. In the first phase values in positions i and i + 1 are compared for i even. The 
next comparison is between such pairs for i odd. Comparisons of this form continue, alternat- 
ing between even and odd values for i, until the sequence is sorted. Since the first phase has 
no effect on the integers {0,1,2, ... ,n — 1}, it is not done. Subsequent phases are shown in 
Fig. 7.18. Pairs that are compared are connected by a light line; a darker line joins pairs whose 
values are swapped. (See Problem 7.16.) 

We now show how to implement efficiently a fully normal ascending algorithm on a linear 
array. (See Fig. 7.19.) Let the exchange locations of the linear array be locations i and i + 1 
of the array for i even. Only elements in exchange locations are swapped. Swapping between 
the first dimension of the hypercube is done by swaps across exchange locations. To simulate 
exchanges across the second dimension, perform a shuffle operation (by reversing the order of 
the operations of Fig. 7.18) on each group of four elements. This places into exchange locations 
elements whose original indices differed by two. Performing a shuffle on eight, sixteen, etc. 
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Figure 7. 1 8 An unshuffle operation is obtained by bubble sorting the integers {0, 1,2, ... ,n — 
1} based on the value of their least significant bits. The cells with bars over them are compared. 
The first set of comparisons is done on elements in the bottom row. Those pairs with light bars 
contain integers whose values are in order. 
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Figure 7.19 A normal ascending algorithm realized by shuffle operations on 2 elements, 
k = 2, 3, 4, . . ., places into exchange locations elements whose indices differ by increasing powers 
of two. Exchange locations are paired together. 



positions places into exchange locations elements whose original indices differed by four, eight, 
etc. The proof of correctness of this result is left to the reader. ( See Problem 7.17.) 

Since a shuffle on n = 2 elements can be done in 2 — 1 steps on a linear array 
with n cells (see Theorem 7-5.2), it follows that this fully normal ascending algorithm uses 
T(n) = 4>{d) steps, where T(2) = <f)(l) = and 



(j)(d) = <j){d- l)+2 d - 1 - 1 



d- 1 



Do a fully normal descending algorithm by a shuffle followed by its steps in reverse order. 

THEOREM 7.7.4 A fully normal ascending (descending) algorithm that runs in d = log 2 n steps 
on a d-dimensional hypercube containing 2 vertices can be realized on a linear array ofn = 2 
elements with T(n) = n — log 2 n — 1 (2T(n)) additional parallel steps. 

From the discussion of Section 7.7 it follows that broadcasting, associative combining, 
and the FFT algorithm can be executed on a linear array in O(n) steps because each can be 
implemented as a normal algorithm on the n-vertex hypercube. Also, a list of n items can 
be sorted on a linear array in O(n) steps by translating Batcher's sorting algorithm based on 
bitonic merging, a normal sorting algorithm, to the linear array. (See Problem 7.20.) 

7.7.5 Fully Normal .Algorithms on Two-Dimensional Arrays 

We now consider the execution of a normal algorithm on a rectangular array. We assume 
that the n = 2 vertices of a 2d-dimensional hypercube are mapped onto an m x m mesh, 
m = 2 , in row-major order. Since each cell is indexed by a pair consisting of row and column 
indices, (r, c), and each of these satisfies < r < m — 1 and < c < TO— 1, they can each be 
represented by a <i-bit binary number. Let r and c be these binary numbers. Thus cell (r, c) 
is indexed by the 2<i-bit binary number re. 

Cells in positions (r, c) and (r,c+ 1) have associated binary numbers that agree in their 
d most significant positions. Cells in positions (r,c) and (r + l,c) have associated binary 
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numbers that agree in their d least significant positions. To simulate a normal hypercube algo- 
rithm on the 2D mesh, in each row simulate a normal hypercube algorithm on 2 vertices after 
which in each column simulate a normal hypercube algorithm on 2 vertices. The correctness 
of this procedure follows because every adjacent pair of vertices of the simulated hypercube is 
at some time located in adjacent cells of the 2D array. 

From Theorem 7.7.4 it follows that hypercube exchanges across the lower d dimensions 
can be simulated in time proportional to the length of a row, that is, in time 0(y/n). Similarly, 
it also follows that hypercube exchanges across the higher d dimensions can be simulated in 
time proportional to 0(\/n). We summarize this result below. 

THEOREM 7.7.5 A fully normal Id-dimensional hypercube algorithm (ascending or descending), 
n = 2 , can be realized in 0(\/n) steps on an y/n x y/n array of 



It follows from the discussion of Section 7.7 that broadcasting, associative combining, 
and the FFT algorithm can be executed on a 2D mesh in 0(\/n) steps because each can be 
implemented as a normal algorithm on the n-vertex hypercube. 

Also, a list of n items can be sorted on an \fn x yfn array in 0(\/n) steps by translating 
a normal merging algorithm to the \fn x yfn array and using it recursively to create a sorting 
network. (See Problem 7.21.) No sorting algorithm can sort in fewer than 2\fm — 2 steps on 
an y/m x yfm array because whatever element is positioned in the lower right-hand corner of 
the array could originate in the upper left-hand corner and have to traverse at least 2y/m — 2 
edges to arrive there. 

7.7.6 Normal Algorithms on Cube-Connected Cycles 

Consider now processors connected as a (i-dimensional cube-connected cycle (CCC) network 
in which each ring has r = 2 > d processors. In particular, let r be the smallest power of 2 
greater than or equal to d, so that d < r < 2d. (Thus k = O(logd).) We call such a CCC 
network a canonical CCC network on n vertices. It has n = r2 vertices, d2 < n < (2d)2 . 
(Thus d = 0(log n).) We show that a fully normal algorithm can be executed efficiently on 
such CCC networks. 

Let each ring of the CCC network be indexed by a d-tuple corresponding to the corner 
of the hypercube at which it resides. Let each processor be indexed by a (d + fc)-tuple in 
which the d low-order bits are the ring index and the k high-order bits specify the position of 
a processor on the ring. 

A fully normal algorithm on a canonical CCC network is implemented in two phases. In 
the first phase, the ring is treated as an array and a fully normal algorithm on the k high-order 
bits is simulated in O(d) steps. In the second phase, exchanges are made across hypercube 
edges. Rotate the elements on each ring so that ring processors whose fc-bit indices are (call 
these the lead elements) are adjacent along the first dimension of the original hypercube. Ex- 
change information between them. Now rotate the rings by one position so that lead elements 
are adjacent along the second dimension of the original hypercube. The elements immediately 
behind the lead elements on the rings are now adjacent along the first hypercube dimension 
and are exchanged in parallel with the lead elements. (This simultaneous execution is called 
pipelining.) Subsequent rotations of the rings place successive ring elements in alignment 
along increasing bit positions. After O(d) rotations all exchanges are complete. Thus, a total 
of O(d) time steps suffice to execute a fully normal algorithm. We have the following result. 
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THEOREM 7.7.6 A fully normal algorithm (ascending or descending) for an n-vertex bypercube 
can be realized in 0(log n) steps on a canonical n-vertex cube-connected cycle network. 

Thus, a fully normal algorithm on an n-vertex hypercube can be simulated on a CCC 
network in time proportional to the time on the hypercube. However, the vertices of the CCC 
have bounded degree, which makes them much easier to realize in hardware than high-degree 
networks. 

7.7.7 Fast Matrix Multiplication on the Hypercube 

Matrix multiplication can be done more quickly on the hypercube than on a two-dimensional 
array. Instead of 0(n) steps, only 0(log n) steps are needed, as we show. 

Consider the multiplication ofnxn matrices A and B for n = 2 r to produce the product 
matrix C = A x B. We describe a normal systolic algorithm to multiply these matrices on a 
d-dimensional hypercube, d = 3r. 

Since d = 3r, the vertices of the <i-dimensional hypercube are addressed by a binary 3r- 
tuple, a = (a$ r -i, a^ r -2> ■ • ■ > a-o)- Let the r least significant bits of a denote an integer i, let 
the next r lsb's denote an integer j, and let the r most significant bits denote an integer k. 
Then, we have \a\ = kn + jn + i since n = 2 r . Because of this identity, we represent the 
address a by the triple (i, j,k). We speak of the processor Pij,k located at the vertex (i, j, k) 
of the d-dimensional hypercube, d = 3r. We denote by HCij- the subhypercube in which i 
and j are fixed and by HCi-j- and HC-j^ the subhypercubes in which the two other pairs 
of indices are fixed. There are 2 2r subhypercubes of each kind. 

We assume that each processor Pjj.fc contains three local variables, -Aj.j.fc, Bij,fc> and 
Cij,k- We also assume that initially ^4 ?J ,o = di,j and B,y,o = bi,j, where < i,j < n — 1. 
The multiplication algorithm has the following five phases: 

a) For each subhypercube HCij- and for 1 < k < n — 1, broadcast Aijfl (containing a^f) 
to j4j,j,fc and Bijfi (containing bi t j) to Bj,j,fc. 

b) For each subhypercube HCi-^k and for < j < n — 1, j =/= k, broadcast A^/.^ 
(containing a it k) to A lijtk . 

c) For each subhypercube HC-j^ and for < i < n — 1, i ^= k, broadcast B^j^k (con- 
taining & fcj ) to B lj: k- 

d) At each processor P ijM compute C ithk = A iijik ■ B i:hk = aiM^k.j- 

e) At processor Pijfi compute the sum C^q = ^2 k Cjj,fc (C«,j,o now contains Cjj = 

J2k a i,kbk,j)- 

From Theorem 7.7.2 it follows that each of these five steps can be done in O(r) steps, 
where r = log 2 n. We summarize this result below. 

THEOREM 7.7.7 Two nxn matrices, n = 2 r ', can be multiplied by a normal systolic algorithm on 
a d-dimensional hypercube, d = 3r, with n 3 processors in O(logn) steps. All normal algorithms 
for nxn matrix multiplication require f2(log n) steps. 

Proof The upper bound follows from the construction. The lower bound follows from the 
observation that each processor that is participating in the execution of a normal algorithm 
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combines two values, one that it owns and one owned by one of its neighbots. Thus, if 
t steps are executed to compute a value, that value cannot depend on more than 2* other 
values. Since each entry in an n x n product matrix is a function of In other values, t must 
be at least log 2 (2n). ■ 

The lower bound stated above applies only to normal algorithms. If a non-normal algo- 
rithm is used, each processor can combine up to d values. Thus, after k steps, up to d values 
can be combined. If In values must be combined, as in n x n matrix multiplication, then 
k > log d (2n) = (log 2 2n)/ log 2 d. If an n 3 -processor hypercube is used for this problem, 
d = 3 log 2 n and k = £1 (log nj log log n) . 

The normal matrix multiplication algorithm described above can be translated to linear 
arrays and 2D meshes using the mappings based on the shuffle and unshuffle operations. The 
2D mesh version has a running time 0(\/nlogn), which is inferior to the running time of 
the algorithm given in Section 7.5.3. 



7.8 Routing in Networks 



A topic of major concern in the design of distributed memory machines is routing, the task of 
transmitting messages among processors via nodes of a network. Routing becomes challenging 
when many messages must travel simultaneously through a network because they can produce 
congestion at nodes and cause delays in the receipt of messages. 

Some routing networks are designed primarily for the permutation-routing problem, the 
problem of establishing a one-to-one correspondence between n senders and n receivers. (A 
processor can be both a sender and receiver.) Each sender sends one message to a unique 
receiver and each receiver receives one message from a unique sender. (We examine in Sec- 
tion 7.9.3 routing methods when the numbers of senders and receivers differ and more than 
one message can be received by one processor.) If many messages are targeted at one receiver, 
a long delay will be experienced at this receiver. It should be noted that network congestion 
can occur at a node even when messages are uniformly distributed throughout the network, 
because many messages may have to pass through this node to reach their destinations. 

7.8. 1 Local Routing Networks 

In a local routing network each message is accompanied by its destination address. At each 
network node (switch) the routing algorithm, using only these addresses and not knowing the 
global state of the network, finds a path for messages. 

A sorting network, suitably modified to transmit messages, is a local permutation-routing 
network. Batcher's bitonic sorting network described in Section 6.8.1 will serve as such a 
network. As mentioned in Section 7.7, this network can be realized as a normal algorithm on 
a hypercube, with running time on an n-vertex hypercube 0(log n). (See Problem 6.28.) 
On the two-dimensional mesh its running time is 0(\/n) (see Problem 7.21), whereas on the 
linear array it is O(n) (see Problem 7.20). 

Batcher's bitonic sorting network is data-oblivious; that is, it performs the same set of op- 
erations for all values of the input data. The outcomes of these operations are data-dependent, 
but the operations themselves are data-independent. Non-oblivious sorting algorithms per- 
form operations that depend on the values of the input data. An example of a local non- 
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oblivious algorithm is one that sends a message from the current network node to the neigh- 
boring node that is closest to the destination. 

7.8.2 Global Routing Networks 

In a global routing network, knowledge of the destinations of all messages is used to set the 
network switches and select paths for the messages to follow. A global permutation-routing 
network realizes permutations of the destination addresses. We now give an example of such a 
network, the Benes permutation network. 

A permutation network is constructed of two-input, two-output switches. Such a switch 
either passes its inputs, labeled A and B, to its outputs, labeled X and Y, or it swaps them. That 
is, the switch is set so that either X = A and Y = B or X = B and Y = A. A permutation 
network on n inputs and n outputs is a directed acyclic graph of these switches such that for 
each permutation of the n inputs, switches can be set to create n disjoint paths from the n 
inputs to the n outputs. 

A Benes permutation network is shown in Fig. 7.20. This graph is produced by con- 
necting two copies of an FFT graph on 2 fc_1 inputs back to back and replacing the nodes 
by switches and edges by pairs of edges. (FFT graphs are described in Section 6.7.3.) It fol- 
lows that a Benes permutation network on n inputs can be realized by a normal algorithm 
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Figure 7.20 A Benes permutation network. 
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that executes O(logn) steps. Thus, a permutation is computed much more quickly (in time 
0(log n)) with the Benes offline permutation network than it can be done on Batcher's online 
bitonic sorting network (in time 0(log n)). However, the Benes network requires time to 
collect the destinations at some central location, compute the switch settings, and transmit 
them to the switches themselves. 

To understand how the Benes network works, we provide an alternative characterization 
of it. Let P n be the Benes network on n inputs, n = 2 , defined as back-to-back FFT graphs 
with nodes replaced by switches. Then P n may be defined recursively, as suggested in Fig. 7.20. 
P n is obtained by making two copies of P n /2, placing njl copies of a two-input, two-output 
switch at the input and the same number at the output. For 1 < i < n/4 (n/4+ 1 < i < n/2) 
the top output of switch i is connected to the top input of the ith switch in the upper (lower) 
copy of P n /2 and the bottom output is connected to the bottom input of the ith switch in the 
lower (upper) copy of P n /2- The connections of output switches are the mirror image of the 
connections of the input switches. 

Consider the Benes network P 2 . It consists of a single switch and generates the two possible 
permutations of the inputs. We show by induction that P n generates all n\ permutations of its 
n inputs. Assume that this property holds for n = 2, 4, . . . , 2 . We show that it holds for 
m = 2 . Let n = (7r(l), 7r(2), . . . , 7r(m)) be an arbitrary permutation to be realized by P m . 
This means that the ith input must be connected to the 7r(z)th output. Suppose that 7r(3) is 
2, as shown in Fig. 7.20. We can arbitrarily choose to have the third input pass through the 
first or second copy of P m /2- We choose the second. The path taken through the second copy 
of P m /2 must emerge on its second output so that it can then pass to the first switch in the 
column of output switches. This output switch must pass its inputs without swapping them. 
The other output of this switch, namely 1 , must arrive via a path through the first copy of 
Pm/2 and emerge on its first output. To determine the input at which it must arrive, we find 
the input of P m associated with the output of 1 and set its switch so that it is directed to the 
first copy of P m /2- Since the other input to this input switch must go to the other copy of 
Pm/2' we follow its path through P m to the output and then reason in the same way about the 
other output at the output switch at which it arrives. If by tracing paths back and forth this 
way we do not exhaust all inputs and outputs, we pick another input and repeat the process 
until all inputs have been routed to outputs. 

Now let's determine the number of switches, S(k), in a Benes network P n on n = 2 
inputs. It follows that 5(1) = 1 and 

S{k) = 2S{k- l)+2 fe 

It is straightforward to show that S(k) = (k — j)2 k = n(log 2 n — I). 

Although a global permutation network sends messages to their destinations more quickly 
than a local permutation network, the switch settings must be computed and distributed glob- 
ally, both of which impose important limitations on the time to realize particular permutations. 

7^ The PRAM Model 

The parallel random-access machine (PRAM) (see Fig. 7.21), the canonical structured par- 
allel machine, consists of a bounded set of processors and a common memory containing a 
potentially unlimited number of words. Each processor is similar to the random-access ma- 
chine (RAM) described in Section 3.4 except that its CPU can access locations in both its local 
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Figure 7.2 I The PRAM consists of synchronous RAMs accessing a common memory. 



random-access memory and the common memory. During each PRAM step, the RAMs exe- 
cute the following steps in synchrony: they (a) read from the common memory, (b) perform 
a local computation, and (c) write to the common memory. Each RAM has its own program 
and program counter as well as a unique identifying number idj that it can access to make 
processor-dependent decisions. The PRAM is primarily an abstract programming model, not 
a machine designed to be built (unlike mesh-based computers, for example). 

The power of the PRAM has been explored by considering a variety of assumptions about 
the length of local computations and the type of instruction allowed. In designing parallel 
algorithms it is generally assumed that each local computation consists of a small number of 
instructions. However, when this restriction is dropped and the PRAM is allowed an unlim- 
ited number of computations between successive accesses to the common memory (the ideal 
PRAM), the information transmitted between processors reflects the minimal amount of in- 
formation that must be exchanged to solve a problem on a parallel computer. 

Because the size of memory words is potentially unbounded, very large numbers can be 
generated very quickly on a PRAM if a RAM can multiply and divide integers and perform 
vector operations. This allows each RAM to emulate a parallel machine with an unbounded 
number of processors. Since the goal is to understand the power of parallelism, however, this 
form of hidden parallelism is usually disallowed, either by not permitting these instructions or 
by assuming that in t steps a PRAM generates numbers whose size is bounded by a polynomial 
in t. To simplify the discussion, we limit instructions in a RAM's repertoire to addition, 
subtraction, vector comparison operations, conditional branching, and shifts by fixed amounts. 
We also allow load and store instructions for moving words between registers, local memories, 
and the common memory. These instructions are sufficiently rich to compute all computable 
functions. 

As yet we have not specified the conditions under which access to the common memory oc- 
curs in the first and third substeps of each PRAM step. If access by more than one RAM to the 
same location is disallowed, access is exclusive. If this restriction does not apply, access is con- 
current. Four combinations of these classifications apply to reading and writing. The strongest 
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restriction is placed on the Exclusive Read/Exclusive Write (EREW) PRAM, with succes- 
sively weaker restrictions placed on the Concurrent Read/Exclusive Write (CREW) PRAM, 
the Exclusive Read/Concurrent Write (ERCW) PRAM, and the Concurrent Read/Con- 
current Write (CRCW) PRAM. When concurrent writing is allowed, conflicts are resolved 
in one of the following ways: a) the COMMON model requires that all RAMs writing to a 
common location write the same value, b) the ARBITRARY model allows an arbitrary value 
to be written, and c) the PRIORITY model writes into the common location the value being 
written by the lowest numbered RAM. 

Observe that any algorithm written for the COMMON CRCW PRAM runs without 
change on the ARBITRARY CRCW PRAM. Similarly, an ARBITRARY CRCW PRAM al- 
gorithm runs without change on the PRIORITY CRCW PRAM. Thus, the latter is the most 
powerful of the PRAM models. 

In performing a computation on a PRAM it is typically assumed that the input is written 
in the lowest numbered locations of the common memory. PRAM computations are charac- 
terized by p, the number of processors (RAMs) in use, and T (time), the number of PRAM 
steps taken. Both measures are usually stated as a function of the size of a problem instance, 
namely m, the number of input words, and n, their total length in bits. 

After showing that tree, array, and hypercube algorithms translate directly to a PRAM 
algorithm with no loss in efficiency, we explore the power of concurrency. This is followed by a 
brief discussion of the simulation of a PRAM on a hypercube and a circuit on a CREW PRAM. 
We close by referring the reader to connections established between PRAMs and circuits and 
to the discussion of serial space and parallel time in Chapter 8. 



7.9.1 Simulating Trees, Arrays, and Hypercubes on the PRAM 

We have shown that ID arrays can be embedded into 2D meshes and that d-dimensional 
meshes can be embedded into hypercubes while preserving the neighborhood structure of the 
first graph in the second. Also, we have demonstrated that any balanced tree algorithm can be 
simulated as a normal algorithm on a hypercube. As a consequence, in each case, an algorithm 
designed for the first network carries over to the second without any increase in the number of 
steps executed. We now show that normal hypercube algorithms are efficiently simulated on 
an EREW PRAM. 

With each d-dimensional hypercube processor, associate an EREW PRAM processor and 
a reserved location in the common memory. In a normal algorithm each hypercube processor 
communicates with its neighbor along a specified direction. To simulate this communication, 
each associated PRAM processor writes the data to be communicated into its reserved location. 
The processor for which the message is destined knows which hypercube neighbor is providing 
the data and reads the value stored in its associated memory location. 

When a hypercube algorithm is not normal, as many as d — 1 neighbors can send messages 
to one processor. Since EREW PRAM processors can access only one cell per unit time, 
simulation of the hypercube can require a running time that is about d times that of the 
hypercube. 

THEOREM 7.9. 1 Every T-step normal algorithm on the d-dimensional, n-vertex hypercube, n = 
2 , can be simulated in 0(T) steps on an n-processor EREW PRAM. Every T-step hypercube 
algorithm, normal or not, can be simulated in 0(Td) steps. 
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An immediate consequence of Theorems 7.7.1 and 7.9.1 is that a list of n items can be 
sorted on an n-processor PRAM in 0(log n) steps by a normal oblivious algorithm. Data- 
dependent sorting algorithms for the hypercube exist with running time 0(log n). 

It also follows from Section 7.6.1 that algorithms for trees, linear arrays, and meshes trans- 
late directly into PRAM algorithms with the same running time as on these less general models. 
Of course, the superior connectivity between PRAM processors might be used to produce faster 
algorithms. 

7.9.2 The Power of Concurrency 

The CRCW PRAM is a very powerful model. As we show, any Boolean function can be 
computed with it in a constant number of steps if a sufficient number of processors is available. 
For this reason, the CRCW PRAM is of limited interest: it represents an extreme that does 
not reflect reality as we know it. The CREW and EREW PRAMs are more realistic. We 
first explore the power of the CRCW and then show that an EREW PRAM can simulate a 
p-processor CRCW PRAM with a slowdown by a factor of O (log p). 

THEOREM 7.9.2 The CRCW PRAM can compute an arbitrary Boolean function in four steps. 

Proof Given a Boolean function / : B n i— ► B, represent it by its disjunctive normal form; 
that is, represent it as the OR of its minterms where a minterm is the AND of each literal of 
/. (A literal is a variable, Xi, or its complement, 3Tj.) Assume that each variable is stored in 
a separate location in the common memory. 

Given a minterm, we show that it can be computed by a CRCW PRAM in two steps. 
Assign one location in the common memory to the minterm and initialize it to the value 1 . 
Assign one processor to each literal in the minterm. The processor assigned to the j ' th literal 
reads the value of the jth variable from the common memory. If the value of the literal is 0, 
this processor writes the value to the memory location associated with the minterm. Thus, 
the minterm has value 1 exactly when each literal has value 1 . Note that these processors read 
concurrently with processors associated with other minterms and may write concurrently if 
more than one of their literals has value 0. 

Now assume that a common memory location has been reserved for the function itself 
and initialized to 0. One processor is assigned to each minterm and if the value of its 
minterm is 1 , it writes the value 1 in the location associated with the function. Thus, in two 
more steps the function / is computed. ■ 

Given the power of concurrency, especially as applied to writing, we now explore the cost 
in performance of not allowing concurrency, whether in reading or writing. 

THEOREM 7.9.3 A p-processor priority CRCW PRAM can be simulated by a p-processor EREW 
PRAM with a slowdown by a factor equal to the time to sort p elements on this machine. Conse- 
quently, this simulation can be done by a normal algorithm with a slowdown factor of O (log p). 

Proof The jth EREW PRAM processor simulates a memory access by the jth CRCW 
PRAM processor by first writing into a special location, Mj, a pair (dj,j) indicating that 
processor j wishes to access (read or write) location ttj. If processors are writing to common 
memory, the value to be written is attached to this pair. If processors are reading from 
common memory, a return message containing the requested value is provided. If a processor 
chooses not to access any location, a dummy address larger than all other addresses is used for 
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a,j. The contents of the locations M\, M 2 , . . . , M p are sorted, which creates a subsequence 
in which pairs with a common address occur together and within which the pairs are sorted 
by processor numbers. From Theorem 7.7.1 it follows that this step can be performed in 
time 0(log p) by a normal algorithm. So far no concurrent reads or writes occur. 

A processor is now assigned to each pair in the sorted sequence. We consider two cases: 
a) processors are reading from or b) writing to common memory. Each processor now 
compares the address of its pair to that of the preceding pair. If a processor finds these 
addresses to be different and case a holds, it reads the item in common memory and sets a 
flag bit to 1; all other processors except the first set their flag bits to 0; the first sets its bit to 1. 
(This bit is used later to distribute the value that was read.) However, if case b holds instead, 
the processor writes its value. Since this processor has the lowest index of all processors and 
the priority CRCW is the strongest model, the value written is the same value written by 
either the common or arbitrary CRCW models. 

Returning now to case a, the flag bits mark the first pair in each subsequence of pairs 
that have the same address in the common memory. Associated with the leading pair is the 
value read at this address. We now perform a segmented prefix computation using as the 
associative rule the copy-right operation. (See Problem 2.20.) It distributes to each pair 
(dj,j) the value the processor wished to read from the common memory. By Problem 2.21 
this problem can be solved by a p-processor EREW PRAM in 0(logp) steps. The pairs 
and their accompanying value are then sorted by the processor number so that the value 
read from the common memory is in a location reserved for the processor that requested the 
value. ■ 

7.9.3 Simulating the PRAM on a Hypercube Network 

As stated above, each PRAM cycle involves reading from the global memory, performing a 
local computation, and writing to the common memory. Of course, a processor need not 
access common memory when given the chance. Thus, to simulate a PRAM on a network 
computer, one has to take into account the fact that not all PRAM processors necessarily read 
from or write to common memory locations on each cycle. 

It is important to remember that the latency of network computers can be large. Thus, for 
the simulation described below to be useful, each PRAM processor must be able to do a lot of 
work between network accesses. 

The EREW PRAM is simulated on a network computer by executing three phases, two of 
which correspond to reading and writing common memory. (To simulate the CRCW PRAM, 
we need only add the time given above to simulate a CRCW PRAM by an EREW PRAM.) 
We simulate an access to common memory by routing a message over the network to the site 
containing the simulated common memory location. It follows that a message must contain 
the name of a site as well as the address of a memory location at that site. If the simulated 
access is a memory read, a return message is generated containing the value of the memory 
location. If it is a memory write, the transmitted message must also contain the datum to write 
into the memory location. We assume that the sites are numbered consecutively from 1 to p, 
the number of processors. 

The first problem to be solved is the routing of messages from source to destination pro- 
cessors. This routing problem was partially addressed in Section 7.8. The new wrinkle here is 
that the mapping from source to destination sites defined by a set of messages is not necessarily 
a permutation. Not all sources may send a message and not all destinations are guaranteed to 
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receive only one message. In fact, some destination may be sent many messages, which can 
result in their waiting a long time for receipt. 

To develop an appreciation for the various approaches to this problem, we describe an 
algorithm that distributes messages from sources to destinations, though not as efficiently as 
possible. Each processor prepares a message to be sent to other processors. Processors not 
accessing the common memory send messages containing dummy site addresses larger than any 
other address. All messages are sorted by destination address cooperatively by the processors. 
As seen in Theorem 7.7.1, they can be sorted by a normal algorithm on an p-vertex hypercube, 
p = 2 , in 0(log p) steps using Batcher's bitonic sorting network described in Section 6.8.1. 
The k < p non-dummy messages are the first k messages in this sorted list. If the sites at 
which these messages reside after sorting are the sites for which they were destined, the message 
routing problem is solved. Unfortunately, this is generally not the case. 

To route the messages from their positions in the sorted list to their destinations, we first 
identify duplicates of destination addresses and compute D, the maximum number of dupli- 
cates. We then route messages in D stages. In each stage at most one of the D duplicates 
of each message is routed to its destination. To identify duplicates, we assign a processor to 
each message in the sorted list that compares its destination site with that of its predecessor, 
setting a flag bit to if equal and to 1 otherwise. To compare destinations, move messages 
to adjacent vertices on the hypercube, compare, and then reverse the process. (Move them by 
sorting by appropriate addresses.) The first processor also sets its flag bit to 1 . A segmented 
integer addition prefix operation that segments its messages with these flag bits assigns to each 
message an integer (a priority) between 1 and D that is q if the site address of this message is 
the qth such address. (Prefix computations can be done on a p-vertex hypercube in 0(logp) 
steps. See Problem 7.23.) A message with priority q is routed to its destination in the qth stage. 
An unsegmented prefix operation with max as the operator is then used to determine D. 

In the gth stage, 1 < q < D, all non-dummy messages with priority q are routed to their 
destination site on the hypercube as follows: 

a) one processor is assigned to each message; 

b) each such processor computes the gap, the difference between the destination and current 
site of its message; 

c) each gap g is represented as a binary d- tuple g = (gd-i> ■ ■ ■ > So); 

d) For t = d — 1, d — 2, . . . , 0, those messages whose gap contains 2* are sent to the site 
reached by crossing the ith dimension of the hypercube. 

We show that in at most 0(D log p) steps all messages are routed to their destinations. 
Let the sorted message sites form an ascending sequence. If there are k non-dummy messages, 
let gapi, < i < k — 1, be the gap of the ith message. Observe that these gaps must also 
form a nondecreasing sequence. For example, shown below is a sorted set of destinations and 
a corresponding sequence of gaps: 
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All the messages whose gaps contain 2 must be the last messages in the sequence be- 
cause the gaps would otherwise be out of order. Thus, advancing messages with these gaps by 
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2 positions, which is done by moving them across the largest dimension of the hypercube, 
advances them to positions in the sequence that cannot be occupied by any other messages, 
even after these messages have been advanced by their full gaps. For example, shown below are 
the positions of the messages given above after those whose gaps contain 8 and 4 have been 
moved by this many positions: 



desti 



i 



12 4 5 7 11 13 15 

12 3 4 5 6 7 8 9 10 11 12 13 15 



Repeating this argument on subsequent smaller powers of 2, we find that no two messages 
that are routed in a given stage occupy the same site. As a consequence, after D stages, each 
taking d steps, all messages are routed. We summarize this result below. 

THEOREM 7.9.4 Each computation cycle of a p-processor EREW PRAM can be simulated by a 
normal algorithm on a p-vertex hypercube in 0(E) log p + log p) steps, where D is the maximum 
number of processors accessing memory locations stored at a given vertex of the hypercube. 

This result can be improved to 0(logp) [158] with a probabilistic algorithm that replicates 
each datum at each hypercube processor a fixed number of times. 

Because the simulation described above of a EREW PRAM on a hypercube consists of a 
fixed number of normal steps and fully normal sequences of steps, 0(D^/p)- and O(Dp)- 
time simulations of a PRAM on two-dimensional meshes and linear arrays follow. (See Prob- 
lems 7.32 and 7.33.) 

7.9.4 Circuits and the CREW PRAM 

Algebraic and logic circuits can also be simulated on PRAMs, in particular the CREW PRAM. 
For simplicity we assign one processor to each vertex of a circuit (a gate). We also assume that 
each vertex has bounded fan-in, which for concreteness is assumed to be 2. We also reserve one 
memory location for each gate and one for each input variable. Each processor now alternates 
between reading values from its two inputs (concurrently with other processors, if necessary) 
and exclusively writing values to the location reserved for its value. Two steps are devoted to 
reading the values of gate inputs. Let -Dr>(/) be the depth of the circuit for a function /. After 
2-Dn(/) steps the input values have propagated to the output gates, the values computed by 
them are correct and the computation is complete. 

In Section 8.14 we show a stronger result, that CREW PRAMs and circuits are equivalent 
as language recognizers. We also explore the parallel computation thesis, which states that 
sequential space and parallel time are polynomially related. It follows that the PRAM and the 
logic circuit are both excellent models in terms of which to measure the minimal computation 
time required for a problem on a parallel machine. In Section 8.15 we exhibit complexity 
classes, that is, classes of languages defined in terms of the depth of circuits recognizing them. 



7.10 The BSP and LogP Models 



Bulk synchronous parallelism (BSP) extends the MIMD model to potentially different asyn- 
chronous programs running on the physical processors of a parallel computer. Its developers 
believe that the BSP model is both built on realistic assumptions and sufficiently simple to 
provide an attractive model for programming parallel computers. They expect it will play a 
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role similar to that of the RAM for serial computation, that is, that programs written for the 
BSP model can be translated into efficient code for a variety of parallel machines. 

The BSP model explicitly assumes that a) computations are divided into supersteps, b) all 
processors are synchronized after each superstep, c) processors can send and receive messages 
to and from all other processors, d) message transmission is non-blocking (computation can 
resume after sending a message), and e) all messages are delivered by the end of a superstep. 
The important parameters of this model are p, the number of processors, s, the speed of each 
processor, I, the latency of the system, which is the number of processor steps to synchronize 
processors, and g, the additional number of processor steps per word to deliver a message. 
Here g measures the time per word to transmit a message between processors after the path 
between them has been set up; I measures the time to set up paths between processors and/or 
to synchronize all p processors. Each of these parameters must be appraised under "normal" 
computational and communication loads if the model is to provide useful estimates of the time 
to complete a task. 

For the BSP model to be effective, it must be possible to keep the processors busy while 
waiting for communications to be completed. If the latency of the network is too high, this 
will not be possible. It will also not be possible if algorithms are not designed properly. For 
example, if all processors attempt to send messages to a single processor, network congestion 
will prevent the messages from being answered quickly. It has been shown that for many 
important problems data can be distributed and algorithms designed to make good use of the 
BSP model [348] . It should also be noted that the BSP model is not effective on problems that 
are not parallelizable, such as may be the case for P-complete problems (see Section 8.9). 

Although for many problems and machines the BSP model is a good one, it does not 
take into account network congestion due to the number of messages in transit. The LogP 
model extends the BSP model by explicitly accounting for the overhead time (the o in LogP) 
to prepare a message for transmission. The model is also characterized by the parameters L, g, 
and P that have the same meaning as the parameters I, g, and p in the BSP model. The LogP 
and BSP models are about equally good at predicting algorithm performance. 

Many other models have been proposed to capture one aspect or another of practical par- 
allel computation. Chapter 11 discusses some of the parallel I/O issues. 



Problems 

PARALLEL COMPUTERS WITH MEMORY 

7.1 Consider the design of a bus arbitration sequential circuit for a computer containing 
four CPUs. This circuit has four Boolean inputs and outputs, one per CPU. A CPU 
requesting bus access sets its input to 1 and waits until its output is set to 1, after which 
it puts its word and destination address on the bus. CPUs not requesting bus access set 
their bus arbitration input variable to 0. 

At the beginning of each cycle the bus arbitration circuit reads the input variables and, 
if at least one of them has value 1 , sets one output variable to 1 . If all input variables 
are 0, it sets all output variables to 0. 

Design two such arbitration circuits, one that grants priority to the lowest indexed 
input that is 1 and a second that grants priority alternately to the lowest and highest 
indexed input if more than one input variable is 1 . 
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Figure 7.22 A four-by-four mesh-of-trees network. 



7.2 Sketch a data-parallel program that operates on a sorted list of keys and finds the largest 
number of times that a key is repeated. 

7.3 Sketch a data-parallel program to find the last record in a linked list where initially each 
record contains the address of the next item in the list (except for the last item, whose 
next address is null). 

Hint: Assign one processor to each list item and assume that accesses to two or more 
distinct addresses can be done simultaneously. 

7.4 The n x n mesh-of-trees network, n = 2 r , is formed from a n x n mesh by replac- 
ing each linear connection forming a row or column by a balanced binary tree. (See 
Fig. 7.22.) Let the entries of two nxn matrices be uniformly distributed on the vertices 
of original mesh. Give an efficient matrix multiplication algorithm on this network and 
determine its running time. 

7.5 Identify problems that arise in a crossbar network when more than one source wishes 
to connect to the same destination. Describe how to insure that only one source is 
connected to one destination at the same time. 

THE PERFORMANCE OF PARALLEL ALGORITHMS 

7.6 Describe how you might apply Amdahl's Law to a data-parallel program to estimate its 
running time. 

1.1 Consider the evaluation of the polynomial p(x) = a n x n + x n -\X n ~ l + ■ ■ • + a,iX+ao 
on a p-processor shared-memory machine. Sketch an algorithm whose running time is 
0(^ + log n) for this problem. 
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LINEAR ARRAYS 

7.8 Generalize the example of Section 7.5.1 to show that the product of an n x n matrix 
and an n-vector can be realized in 3ra — 1 steps on a linear systolic array. 

7.9 Show that every algorithm on a linear array to compute the product of an n x n matrix 
and an n-vector requires at least n steps. Assume that components of the matrix and 
vector enter cells individually. 

7.10 Design an algorithm for a linear array of length O(n) that convolves two sequences 
each of length n in O(n) steps. Show that no substantially faster algorithm for such a 
linear array exists. 

MULTIDIMENSIONAL ARRAYS 

7.1 1 Show that at most u(d) = 2d 2 + 2d + 1 cells are at most d edges away from any cell 
in a two-dimensional systolic array. 

7.12 Derive an expression for the distance between vertices (n\, ri2, ■ ■ ■ , rid) and {m\, 1712, 
. . . , rrid) in a d-dimensional toroidal mesh and determine the maximum distance be- 
tween two such vertices. 

7.13 Design efficient algorithms to multiply two n x n matrices on a k x k mesh, k < n. 

HYPERCUBE-BASED MACHINES 

7.14 Show that the vertices of the 2 -input FFT graph can be numbered so that edges be- 
tween levels correspond to swaps across the dimensions of a ci-dimensional hypercube. 

7.15 Show that the convolution function / c "nv : R n+m i— > R n + m - 1 over a commutative 
ring 1Z can be implemented by a fully normal algorithm in time 0(log n). 

7 AG Prove that the unshuffle operation on a linear array of n = 2 cells can be done with 
2—1 comparison/exchange steps. 

7.17 Prove that the algorithm described in Section 7.7.4 to simulate a normal hypercube 
algorithm on a linear array of n = 2 elements correctly places into exchange locations 
elements whose indices differ by successive powers of 2. 

7.18 Describe an efficient algorithm for a linear array that merges two sorted sequences of 
the same length. 

7.19 Show that Batcher's sorting algorithm based on bitonic merging can be realized on an 
p-vertex hypercube by a normal algorithm in 0(log p) steps. 

7.20 Show that Batcher's sorting algorithm based on bitonic merging can be realized on a 
linear array of n = 2 cells in Oin) steps. 

7.21 Show that Batcher's sorting algorithm based on bitonic merging can be realized on an 
-y/n x y/n array in 0(y/n) steps. 

7.22 Design an 0(y / n)-step algorithm to implement an arbitrary permutation of n items 
placed one per cell of an y/n x \fn mesh. 

7.23 Describe a normal algorithm to realize a prefix computation on a p-vertex hypercube in 
O(logp) steps. 
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7.24 Design an algorithm to perform a prefix computation on an \fn x ^/n mesh in 3yn 
steps. Show that no other algorithm for this problem on this mesh has substantially 
better performance. 

ROUTING IN NETWORKS 

7.25 Give a complete description of a procedure to set up the switches in a Benes network. 

7.26 Show how to perform an arbitrary permutation on a linear array. 

THE PRAM MODEL 

7.27 a) Design an 0(l)-step CRCW PRAM algorithm to find the maximum element in a 

list. 

b) Design an 0(log log n)-step CRCW PRAM algorithm to find the maximum ele- 
ment in a list that uses O(n) processors. 

Hint: Construct a tree in which the root and every other vertex has a number of 
immediate descendants that is about equal to the square root of the number of leaves 
that are its descendants. 

7.28 The goal of the list-ranking problem is to assign a rank to each record in a linked 
list; the rank of a record is its position relative to the last element in the list where the 
last element has rank zero. Each record has two fields, one for its rank and another for 
the address of its successor record. The address field of the last record contains its own 
address. 

Describe an efficient p-processor EREW PRAM algorithm to solve the list-ranking 
problem for a list of p items stored one per location in the common memory. 

Hint: Use pointer doubling in which each address is replaced by the address of its 
current successor. 

7.29 Consider an n-vertex directed graph in which each vertex knows the address of its 
parent and the roots have themselves as parents. Under the assumption that each vertex 
is placed in a unique cell in a common PRAM memory, show that the roots can be 
found in O(logn) steps. 

7.30 Design an efficient PRAM algorithm to find the item in a list that occurs most often. 

7.31 Figure 7.23 shows two trees containing one and three copies of a computational ele- 
ment, respectively. This element accepts three inputs and produces three outputs using 
0, an associative operator. Tree (a) accepts a, b, and c as input and produces a, a b, 
and b c as output. Tree (b) accepts a, b, c, d, and e as input and produces a, a b, 
aObOc, aQbQcQd, and 60c0d0eas output. If the input and output at the root 
of the trees are combined with 0, the output of each tree is the prefix computation on 
its inputs. 

Generalize the constructions of Fig. 7.23 to produce a circuit for the prefix function on 
n inputs, n arbitrary. Give a convincing argument that your construction is correct and 
derive good upper bounds on the size and depth of your circuit. Show that to within 
multiplicative factors your construction has minimal size and depth. 
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(a) 
Figure 7.23 Components of an efficient prefix circuit. 



(b) 



7.32 Show that each computation cycle of a p-processor EREW PRAM can be simulated on 
a y/P x \/P mesh in 0{D^fp) steps, where D is the maximum number of processors 
accessing memory locations stored at a given vertex of the mesh. 

7.33 Show that each computation cycle of a p-processor EREW PRAM can be simulated 
on a p-processor linear array in O(Dp) steps, where D is the maximum number of 
processors accessing memory locations stored at a given vertex of the array. 

THE BSP AND LOGP MODELS 

7.34 Design an algorithm for the p-processor BSP and/or LogP models to multiply two nxn 
matrices when each matrix entry occurs once and entries are uniformly distributed over 
the p processors. Given the parameters of the models, determine for which values of n 
your algorithm is efficient. 

Hint: The performance of your algorithm will be dependent on the initial placement 
of data. 

7.35 Design an algorithm for the p-processor BSP and/or LogP models for the segmented 
prefix function. Given the parameters of the models, determine for which values of n 
your algorithm is efficient. 



Chapter Notes 



A discussion of parallel algorithms and architectures up to about 1980 can be found in the book 
by Hockney and Jesshope [135]. A number of recent textbooks provide extensive coverage of 
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parallel algorithms and architectures. They include the books by Aid [16], Bertsekas and 
Tsitsiklis [38], Gibbons and Spirakis [113], Jaja [148], Leighton [192], Quinn [265], and Reif 
[277]. In addition, the survey article by Karp and Ramachandran [161] gives an overview 
of parallel algorithmic methods. References to results on circuit complexity can be found in 
Chapters 2, 6, and 9. 

Flynn introduced the taxonomy of parallel computers that carries his name [102]. The 
data-parallel style of computing was anticipated in the APL [146] and FP programming lan- 
guages [26] as well as by Preparata and Vuillemin [262] in their study of parallel algorithms 
for networked machines. It was developed as the style of choice for programming the Connec- 
tion Machine [133]. (See also the books by Hatcher and Quinn [129] and Blelloch [45] on 
data-parallel computing.) The simulation of the MIMD computer by a SIMD one given in 
Section 7.3.1 is due to Wloka [365]. 

Amdahl's Law [21] and Brent's principle [58] are widely cited; the latter is used extensively 
to design efficient parallel algorithms. 

Systolic algorithms for convolution, matrix multiplication, and the fast Fourier transform 
are given by Kung and Leiserson [180] (see also [181]). Odd-even transposition sort is de- 
scribed by Knuth [170]. The lower bound on the time to multiply two matrices given in 
Theorem 7.5.3 is due to Gentleman [112]. The shuffle network was introduced by Stone 
[318], 

Preparata and Vuillemin [262] give normal algorithms for a variety of problems (including 
that for shifting in Section 7.7.3) and introduce the cube-connected cycles machine. They also 
give embeddings of fully normal algorithms into linear arrays and meshes. Dekel, Nassimi, and 
Sahni [85] developed the fast algorithm for matrix multiplication on the hypercube described 
in Section 7-7.7. 

Batcher [29] introduced odd-even and bitonic sorting methods and noted that they could 
be used for routing messages in networks. Benes [36] is the author of the Benes permutation 
network. 

Variants of the PRAM were introduced by Fortune and Wyllie [103], Goldschlager [118], 
Savitch and Stimson [298] as generalizations of the idealized RAM model of Cook and Reck- 
how [77]. The method given in Theorem 7.9.3 to simulate a CRCW PRAM on an EREW 
PRAM is due to Eckstein [95] and Vishkin [353]. Simulations of PRAMs on networked com- 
puters have been developed by Mehlhorn and Vishkin [221], Upfal [340], Upfal and Wigder- 
son [341], Karlin and Upfal [158], Alt, Hagerup, Mehlhorn, and Preparata [19], and Ranade 
[267]. Cypher and Plaxton [84] have developed a deterministic O (log p log log p) -step sort- 
ing algorithm for the hypercube. However, it is superior to Batcher's algorithm only for very 
large and impractical values of p. 

The bulk synchronous parallel (BSP) model [348] has been proposed as a bridging model 
between the needs of programmers and parallel machines. The LogP model [83] is offered as 
a more realistic variant of the BSP model. Juurlink and Wijshoff [154] and Bilardi, Herley, 
Pietracaprina, Pucci, and Spirakis [39] report empirical evidence that the BSP and LogP models 
are about equally good as predictors of performance on real parallel computers. 



Part III 

COMPUTATIONAL 
COMPLEXITY 




Complexity Classes 



In an ideal world, each computational problem would be classified at least approximately by its 
use of computational resources. Unfortunately, our ability to so classify some important prob- 
lems is limited. We must be content to show that such problems fall into general complexity 
classes, such as the polynomial-time problems P, problems whose running time on a determin- 
istic Turing machine is a polynomial in the length of its input, or NP, the polynomial-time 
problems on nondeterministic Turing machines. 

Many complexity classes contain "complete problems," problems that are hardest in the 
class. If the complexity of one complete problem is known, that of all complete problems is 
known. Thus, it is very useful to know that a problem is complete for a particular complexity 
class. For example, the class of NP-complete problems, the hardest problems in NP, contains 
many hundreds of important combinatorial problems such as the Traveling Salesperson Prob- 
lem. It is known that each NP-complete problem can be solved in time exponential in the size 
of the problem, but it is not known whether they can be solved in polynomial time. Whether 
P and NP are equal or not is known as the P = NP question. Decades of research have been 
devoted to this question without success. As a consequence, knowing that a problem is NP- 
complete is good evidence that it is an exponential-time problem. On the other hand, if one 
such problem were shown to be in P, all such problems would be been shown to be in P, a 
result that would be most important. 

In this chapter we classify problems by the resources they use on serial and parallel ma- 
chines. The serial models are the Turing and random-access machines. The parallel models 
are the circuit and the parallel random-access machine (PRAM). We begin with a discussion 
of tasks, machine models, and resource measures, after which we examine serial complexity 
classes and relationships among them. Complete problems are defined and the P-complete, 
NP-complete, and PSPACE-complete problems are examined. We then turn to the PRAM 
and circuit models and conclude by identifying important circuit complexity classes such as 
NC and P/poly. 
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8.1 Introduction 

The classification of problems requires a precise definition of those problems and the com- 
putational models used. Problems are accurately classified only when we are sure that they 
have been well defined and that the computational models against which they are classified are 
representative of the computational environment in which these problems will be solved. This 
requires the computational models to be general. On the other hand, to be useful, problem 
classifications should not be overly dependent on the characteristics of the machine model used 
for classification purposes. For example, because of the obviously inefficient use of memory on 
the Turing machine, the set of problems that runs in time linear in the length of their input on 
a random-access machine is likely to be different from the set that runs in linear time on the 
Turing machine. On the other hand, the set of problems that run in polynomial time on both 
machines is the same. 



8.2 Languages and Problems 



Before formally defining decision problems, a major topic of this chapter, we give two examples 
of them, SATISFIABILITY and UNSATISFIABILITY. A set of clauses is satisfiable if values can 
be assigned to Boolean variables in these clauses such that each clause has at least one literal 
with value 1 . 

SATISFIABILITY 

Instance: A set of literals X = {x\, X\, xx,X2, ■ ■ ■ , x n ,x n }, and a sequence of clauses 

C = (ci, Ci, . . . , c m ) where each clause c^ is a subset of X. 

Answer: "Yes" if for some assignment of Boolean values to variables in {x\, X2, • ■ • , x n }, at 

least one literal in each clause has value 1 . 

The complement of the decision problem SATISFIABILITY, UNSATISFIABILITY, is defined 
below. 

UNSATISFIABILITY 

Instance: A set of literals X = {xi,X\,X2,X2, ■ ■ ■ ,x n ,x n }, and a sequence of clauses 

C = (ci, Cz, ■ ■ ■ , c m ) where each clause c; is a subset of X. 

Answer: "Yes" if for all assignments of Boolean values to variables in {x\, X2, ■ • ■ , x n }, all 

literals in at least one clause have value 0. 

The clauses C\ = (\x\, X2, £3}, {£1,^2}. {x2,x$}) are satisfied with a; 1 = x% = £3 = 1. 
whereas the clauses C2 = ({xi, Xi, £3}, {x\,x~2}, {^2:^3}. {xi,X\], {x\, x~2,Xi}) are not 
satisfiable. SATISFIABILITY consists of collections of satisfiable clauses. C\ is in SATISFIABIL- 
ITY. The complement of SATISFIABILITY, UNSATISFIABILITY, consists of instances of clauses 
not all of which can be satisfied. C 2 is in UNSATISFIABILITY. 

We now introduce terminology used to classify problems. This terminology and the asso- 
ciated concepts are used throughout this chapter. 

DEFINITION 8.2. 1 Let £ be an arbitrary finite alphabet. A decision problem V is defined by a 
set of instances ICE* of the problem and a condition <ft-p : I 1— > B that has value 1 on "Yes" 
instances and on "No" instances. Then I ycs = {w G / | (f>-p(w) = 1} are the "Yes" instances. 
The "No" instances are I no = I — I ycs . 
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The complement of a decision problem V, denoted coV, is the decision problem in which 
the "Yes" instances of coP are the "No" instances of V and vice versa. 

The "Yes" instances of a decision problem are encoded as binary strings by an encoding func- 
tion a : E* i— > B* that assigns to each w G / a string a(w) G B* . 

With respect to a, the language L(V) associated with a decision problem V is the set 
L(V) = {a(w) | w € /yes}- With respect to a, the language L(coV) associated with coP is the 
setL(coV) = {cr{w) \ w G I no }. 

The complement of a language L, denoted L, is B* — L; that is, L consists of the strings 
that are not in L. 

A decision problem can be generalized to a problem V characterized by a function f : B* i— > 
B* described by a set of ordered pairs (x, f{x)), where each string x G B* appears once as the 
left-hand side of a pair. Thus, a language is defined by problems f : B* i— > B and consists of the 
strings on which f has value 1 . 

SATISFIABILITY and all other decision problems in NP have succinct "certificates" for 
"Yes" instances, that is, choices on a nondeterministic Turing machine that lead to acceptance 
of a "Yes" instance in a number of steps that is a polynomial in the length of the instance. A 
certificate for an instance of SATISFIABILITY consists of values for the variables of the instance 
on which each clause has at least one literal with value 1 . The verification of such a certificate 
can be done on a Turing machine in a number of steps that is quadratic in the length of the 
input. (See Problem 8.3.) 

Similarly, UNSATISFIABILITY and all other decision problems in coNP can be disqualified 
quickly; that is, their "No" instances can be "disqualified" quickly by exhibiting certificates for 
them (which are certificates for the "Yes" instance of the complementary decision problem). 
For example, a disqualification for UNSATISFIABILITY is a satisfiable assignment for a "No" 
instance, that is, a satisfiable set of clauses. 

It is not known how to identify a certificate for a "Yes" instance of SATISFIABILITY or any 
other NP-compIete problem in time polynomial in length of the instance. If a "Yes" instance 
has n variables, an exhaustive search of the 2 n values for the n variables is about the best general 
method known to find an answer. 

8.2.1 Complements of Languages and Decision Problems 

There are many ways to encode problem instances. For example, for SATISFIABILITY we 
might represent Xi as i and Xi as ~i and then use the standard seven-bit ASCII encodings for 
characters. Then we would translate the clause {x^, Xj} into {4, ~7} and then represent it as 
123 052 044 126 055 125, where each number is a decimal representing a binary 7-tuple and 
4, comma, and ~ are represented by 052, 044, and 126, respectively, for example. 

All the instances / of decision problems V considered in this chapter are characterized 
by regular expressions. In addition, the encoding function of Definition 8.2.1 can be chosen 
to map strings in / to binary strings o(I) describable by regular expressions. Thus, a finite- 
state machine can be used to determine if a binary string is in a(I) or not. We assume that 
membership of a string in <r(I) can be determined efficiently. 

As suggested by Fig. 8.1, the strings in L(V), the complement of L(V), are either strings 
in L(coV) or strings in a(T,* — I). Since testing of membership in cr(S* — /) is easy, testing 
for membership in L(V) and L{c6P) requires about the same space and time. For this reason, 
we often equate the two when discussing the complements of languages. 
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Figure 8.1 The language L(f) of a decision problem T and the language of its complement 
L(coV). The languages L(V) and L(coP) encode all instances of/. The complement of L(V), 
L(V), is the union of L(coV) with <r(E* — /), strings that are in neither L(V) nor L(coV). 



8.3 Resource Bounds 

One of the most important problems in computer science is the identification of the computa- 
tionally feasible problems. Currently a problem is considered feasible if its running time on a 
DTM (deterministic Turing machine) is polynomial. (Stated by Edmonds [96] , this is known 
as the serial computation thesis.) Note, however, that some polynomial running times, such 
as n , where n is the length of a problem instance, can be enormous. In this case doubling 
n increases the time bound by a factor of 2 , which is approximately 10 ! 

Since problems are classified by their use of resources, we need to be precise about resource 
bounds. These are functions r :NhN from the natural numbers IN = {0, 1, 2, 3, . . .} to 
the natural numbers. The resource functions used in this chapter are: 



Logarithmic function 


r(n) 


= O (log 71 


Poly-logarithmic function 


r(n) 


= log ^ , 


Linear function 


rin) 


= 0(n) 


Polynomial function 


rin) 


= n O{l) 


Exponential function 


rin) 


= 2"° (,) 



A resource function that grows faster than any polynomial is called a superpolynomial func- 
tion. For example, the function fin) = 2 s n grows faster than any polynomial (the ratio 
log f(n)/ log n is unbounded) but more slowly than any exponential (for any fc > the ratio 
(log n)/n becomes vanishingly small with increasing n). 

Another note of caution is appropriate here when comparing resource functions. Even 
though one function, r(n), may grow more slowly asymptotically than another, sin), it may 
still be true that rin) > sin) for very large values of n. For example, r(n) = 10 log n > 
s(n) = n for n < 1,889,750 despite the fact that r(n) is much smaller than s(n) for large n. 

Some resource functions are so complex that they cannot be computed in the time or space 
that they define. For this reason we assume throughout this chapter that all resource functions 
are proper. (Definitions of time and space on Turing machines are given in Section 8.4.2.) 

DEFINITION 8.3. 1 A function riNntNii proper if it is nondecreasing (r(n + 1) > r(n)) 
and for some tape symbol a there is a deterministic multi-tape Turing machine M that, on all 
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inputs of length n in time 0(n + r(n)) and temporary space r(n), writes the string a r ^ n > (unary 
notation^er r(n)) on one of its tapes and halts. 

Thus, if a resource function r(n) is proper, there is a DTM, M r , that given an input of length 
n can write r(n) markers on one of its tapes within time 0(n+r(n)) and space r(n). Another 
DTM, M, can use a copy of M r to mark r(n) squares on a tape that can be used to stop M 
after exactly Kr(n) steps for some constant K. The resource function can also be used to 
insure that M uses no more than Kr(n) cells on its work tapes. 



8.4 Serial Computational Models 



We consider two serial computational models in this chapter, the random-access machine 
(RAM) introduced in Section 3.4 and the Turing machine defined in Chapter 5. 

In this section we show that, up to polynomial differences in running time, the random- 
access and Turing machines are equivalent. As a consequence, if the running time of a problem 
on one machine grows at least as fast as a polynomial in the length of a problem instance, then 
it grows equally fast on the other machine. This justifies using the Turing machine as basis for 
classifying problems by their serial complexity. 

In Sections 8.13 and 8.14 we examine two parallel models of computation, the logic circuit 
and the parallel random-access machine (PRAM). 

Before beginning our discussion of models, we note that any model can be considered 
either serial or parallel. For example, a finite-state machine operating on inputs and states 
represented by many bits is a parallel machine. On the other hand, a PRAM that uses one 
simple RAM processor is serial. 

8.4.1 The Random-Access Machine 

The random-access machine (RAM) is introduced in Section 3.4. (See Fig. 8.2.) In this section 
we generalize the simulation results developed in Section 3.7 by considering a RAM in which 
words are of potentially unbounded length. This RAM is assumed to have instructions for 
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Figure 8.2 A RAM in which the number and length of words are potentially unbounded. 
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addition, subtraction, shifting left and right by one place, comparison of words, and Boolean 
operations of AND, OR, and NOT (the operations are performed on corresponding components 
of the source vectors), as well as conditional and unconditional jump instructions. The RAM 
also has load (and store) instructions that move words to (from) registers from (to) the random- 
access memory. Immediate and direct addressing are allowed. An immediate address contains 
a value, a direct address is the address of a value, and an indirect address is the address of 
the address of a value. (As explained in Section 3.10 and stated in Problem 3.10, indirect 
addressing does not add to the computing power of the RAM and is considered only in the 
problems.) 

The time on a RAM is the number of steps it executes. The space is the maximum number 
of bits of storage used either in the CPU or the random-access memory during a computation. 

We simplify the RAM without changing its nature by eliminating its registers, treating 
location of the random-access memory as the accumulator, and using memory locations as 
registers. The RAM retains its program counter, which is incremented on each instruction 
execution (except for a jump instruction, when its value is set to the address supplied by the 
jump instruction). The word length of the RAM model is typically allowed to be unlimited, 
although in Section 3.4 we limited it to 6 bits. A RAM program is a finite sequence of RAM 
instructions that is stored in the random-access memory. The RAM implements the stored- 
program concept described in Section 3.4. 

In Theorem 3.8.1 we showed that a 6-bit standard Turing machine (its tape alphabet con- 
tains 2 characters) executing T steps and using S bits of storage (S/b words) can be simulated 
by the RAM described above in 0(T) steps with O(S) bits of storage. Similarly, we showed 
that a 6-bit RAM executing T steps and using S bits of memory can be simulated by an 0(b)- 
bit standard Turing machine in O (ST log S) steps and 0(S log S) bits of storage. As seen 
in Section 5.2, T-step computations on a multi-tape TM can be simulated in 0(T 2 ) steps on 
a standard Turing machine. 

If we could insure that a RAM that executes T steps uses a highest address that is 0(T) and 
generates words of fixed length, then we could use the above-mentioned simulation to establish 
that a standard Turing machine can simulate an arbitrary T-step RAM computation in time 
0(T 2 log T) and space O(SlogS) measured in bits. Unfortunately, words can have length 
proportional to 0(T) (see Problem 8.4) and the highest address can be much larger than T due 
to the use of jumps. Nonetheless, a reasonably efficient polynomial-time simulation of a RAM 
computation by a DTM can be produced. Such a DTM places one (address, contents) 
pair on its tape for each RAM memory location visited by the RAM. (See Problem 8.5.) 

We leave the proof of the following result to the reader. (See Problem 8.6.) 

THEOREM 8.4.1 Every computation on the RAM using time T can be simulated by a deterministic 
Turing machine in 0(7 ) steps. 

In light of the above results and since we are generally interested in problems whose time 
is polynomial in the length of the input, we use the DTM as our model of serial computation. 

8.4.2 Turing Machine Models 

The deterministic and nondeterministic Turing machines (DTM and NDTM) are discussed 
in Sections 3.7, 5.1, and 5.2. (See Fig. 8.3.) In this chapter we use multi-tape Turing machines 
to define classes of problems characterized by their use of time and space. As shown in The- 
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Figure 8.3 A one-tape nondeterministic Turing machine whose control unit has an external 
choice input that disambiguates the value of its next state. 



orem 5.2.2, the general language-recognition capability of DTMs and NDTMs is the same, 
although, as we shall see, their ability to recognize languages within the same resource bounds 
is very different. 

We recognize two types of Turing machine, the standard one-tape DTM and NDTM and 
the multi-tape DTM and NDTM. The multi-tape versions are defined here to have one read- 
only input tape, one write-only output tape, and one or more work tapes. The space on these 
machines is defined to be the number of work tape cells used during a computation. This 
measure allows us to classify problems by a storage that may be less than linear in the size of 
the input. Time is the number of steps they execute. It is interesting to compare these measures 
with those for the RAM. (See Problem 8.7.) As shown on Section 5.2, we can assume without 
loss of generality that each NDTM has either one or two choices for next state for any given 
input letters and state. 

As stated in Definitions 3.7.1 and 5.1.1, a DTM M accepts the language L if and only if 
for each string in L placed left- adjusted on the otherwise blank input tape it eventually enters 
the accepting halt state. A language accepted by a DTM M is recursive if M halts on all 
inputs. Otherwise it is recursively enumerable. A DTM M computes a partial function / 
if for each input string w for which / is defined, it prints f(w) left- adjusted on its otherwise 
blank output tape. A complete function is one that is defined on all points of its domain. 

As stated in Definition 5.2.1, an NDTM accepts the language L if for each string w in 
L placed left-adjusted on the otherwise blank input tape there is a choice input c for M that 
leads to an accepting halt state. A NDTM M computes a partial function / : B* i— ► B* if 
for each input string w for which / is defined, there is a sequence of moves by M that causes 
it to print f(w) on its output tape and enter a halt state and there is no choice input for which 
M prints an incorrect result. 

The oracle Turing machine (OTM), the multi-tape DTM or NDTM with a special oracle 
tape, defined in Section 5.2.3, is used to classify problems. (See Problem 8.15.) Time on an 
OTM is the number of steps it takes, where one consultation of the oracle is one step, whereas 
space is the number of cells used on its work tapes not including the oracle tape. 
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A precise Turing machine M is a multi-tape DTM or NDTM for which there is a func- 
tion r(n) such that for every n > 1, every input w of length n, and every (possibly nondeter- 
ministic) computation by M, M halts after precisely r(n) steps. 

We now show that if a total function can be computed by a DTM, NDTM, or OTM 
within a proper time or space bound, it can be computed within approximately the same 
resource bound by a precise TM of the same type. The following theorem justifies the use of 
proper resource functions. 

THEOREM 8.4.2 Letr{n) be a proper function with r{n) > n. Let M be a multi-tape DTM, 
NDTM, or OTM with k work tapes that computes a total function f in time or space r(n). Then 
there is a constant K > and a precise Turing machine of the same type that computes f in time 
and space Kr(n). 

Proof Since r(n) is a proper function, there is a DTM M r that computes its value from an 
input of length n in time K\r(n) for some constant K\ > and in space r(n). We design 
a precise TM M p computing the same function. 

The TM M p has an "enumeration tape" that is distinct from its work tapes. M p initially 
invokes M r to write r(n) instances of the letter a on the enumeration tape in Kir(n) steps, 
after which it returns the head on this tape to its initial position. 

Suppose that M computes / within a time bound of r(n). M p then alternates between 
simulating one step of M on its work tapes and advancing its head on the enumeration 
tape. When M halts, M p continues to read and advance the head on its enumeration tape 
on alternate steps until it encounters a blank. Clearly, M p halts in precisely (K\ + 2)r(n) 
steps. 

Suppose now that M computes / in space r(n). M p invokes M r to write r(n) special 
blank symbols on each of its work tapes. It then simulates M, treating the special blank 
symbols as standard blanks. Thus, M p uses precisely kr(n) cells on its k work tapes. ■ 

Configuration graphs, defined in Section 5.3, are graphs that capture the state of Turing 
machines with potentially unlimited storage capacity. Since all resource bounds are proper, as 
we know from Theorem 8.4.2, all DTMs and NDTMs used for decision problems halt on all 
inputs. Furthermore, NDTMs never give an incorrect answer. Thus, configuration graphs can 
be assumed to be acyclic. 

8.5 Classification of Decision Problems 

In this section we classify decision problems by the resources they consume on deterministic 
and nondeterministic Turing machines. We begin with the definition of complexity classes. 

DEFINITION 8.5.1 Letr(n) :Ni->N be a proper resource function. ThenTlME(r(n)) and 
SPACE(r (n) ) are the time and space Turing complexity classes containing languages that 
can be recognized by DTMs that halt on all inputs in time and space r(n), respectively, where n is 
the length of an input. NTIME(r(n)) ^«^NSPACE(r(n)) are the nondeterministic time 
and space Turing complexity classes, respectively, defined for NDTMs instead of DTMs. The 
union of complexity classes is also a complexity class. 

Let k be a positive integer. Then TIME(fc") and NSPACE(n ) are examples of complexity 
classes. They are the decision problems solvable in deterministic time k n and nondeterministic 
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space n , respectively, for n the length of the input. Since time and space on a Turing machine 
are measured by the number of steps and number of tape cells, it is straightforward to show 
that time and space for a given Turing machine, deterministic or not, can each be reduced by 
a constant factor by modifying the Turing machine description so that it acts on larger units 
of information. (See Problem 8.8.) Thus, for a constant K > the following classes are the 
same: a) TIME(fc n ) and TlME(Kk n ), b) NTIME(fc") and NTIME^fc"), c) SPACE(n fc ) 
and SVACE(Kn k ), and d) NSPACE(n fc ) and NSPACE(A'n fc ). 

To emphasize that the union of complexity classes is another complexity class, we define 
as unions two of the most important Turing complexity classes, P, the class of deterministic 
polynomial-time decision problems, and NP, the class of nondeterministic polynomial-time 
decision problems. 

DEFINITION 8.5.2 The classes P and NP are sets of decision problems solvable in polynomial time 
on DTMs and NDTMs, respectively; that is, they are defined as follows: 

n k ) 



P = (J TIME( 

fc>o 

NP = (J NTIME( 



n 

fc>0 



Thus, for each decision problem V in P there is a DTM AI and a polynomial p(n) such 
that M halts on each input string of length n in p(n) steps, accepting this string if it is an 
instance w of V and rejecting it otherwise. 

Also, for each decision problem V in NP there is an NDTM M and a polynomial p(n) 
such that for each instance w of V, \w\ = n, there is a choice input of length p(n) such that 
M accepts w in p(n) steps. 

Problems in P are considered feasible problems because they can be decided in time poly- 
nomial in the length of their input. Even though some polynomial functions, such as n , 
grow very rapidly in their one parameter, at the present time problems in P are considered 
feasible. Problems that require exponential time are not considered feasible. 

The class NP includes the decision problems associated with many hundreds of important 
searching and optimization problems, such as TRAVELING SALESPERSON described below. 
(See Fig. 8.4.) If P is equal to NP, then these important problems have feasible solutions. If 
not, then there are problems in NP that require superpolynomial time and are therefore largely 

infeasible. Thus, it is very important to have the answer to the question P = NP. 

TRAVELING SALESPERSON 

Instance: An integer k and a set of n 2 symmetric integer distances {dij | 1 < i,j < n} 

between n cities where dij = djj. 

Answer: "Yes" if there is a tour (an ordering) {i\, ii, ■ ■ ■ , i n } of the cities such that the 

length / = di lt i 2 + di lt i } + ■ • • + di n! i, of the tour satisfies I < k. 

The TRAVELING SALESPERSON problem is in NP because a tour satisfying / < k can 
be chosen nondeterministically in n steps and the condition I < k then verified in a polyno- 
mial number of steps by finding the distances between successive cities on the chosen tour in 
the description of the problem and adding them together. (See Problem 3.24.) Many other 
important problems are in NP, as we see in Section 8.10. While it is unknown whether a 
deterministic polynomial-time algorithm exists for this problem, it can clearly be solved deter- 
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Figure 8.4 A graph on which the TRAVELING SALESPERSON problem is defined. The heavy 
edges identify a shortest tour. 



ministically in exponential time by enumerating all tours and choosing the one with smallest 
length. (See Problem 8.9.) 

The TRAVELING SALESPERSON decision problem is a reduction of the traveling sales- 
person optimization problem, whose goal is to find the shortest tour that visits each city 
once. The output of the optimization problem is an ordering of the cities that has the short- 
est tour. By contrast, the TRAVELING SALESPERSON decision problem reports that there is 
or is not a tour of length k or less. Given an algorithm for the optimization problem, the 
decision problem can be solved by calculating the length of an optimal tour and comparing 
it to the parameter k of the decision problem. Since the latter steps can be done in polyno- 
mial time, if the optimization algorithm can be done in polynomial time, so can the decision 
problem. On the other hand, given an algorithm for the decision problem, the optimization 
problem can be solved through bisection as follows: a) Since the length of the shortest tour 
is in the interval [nmirijj dij,nma.Xij dij), invoke the decision algorithm with k equal to 
the midpoint of this interval, b) If the instance is a "yes" instance, let k be the midpoint 
of the lower half of the current interval; if not, let it be the midpoint of the upper half, c) 
Repeat the previous step until the interval is reduced to one integer. The interval is bisected 
0(logn(maXi J djj — mirijj di,j)) times. Thus, if the decision problem can be solved in 
polynomial time, so can the optimization problem. 

Whether P = NP is one of the outstanding problems of computer science. The current 
consensus of complexity theorists is that nondeterminism is such a powerful specification de- 
vice that they are not equal. We return to this topic in Section 8.8. 

8.5.1 Space and Time Hierarchies 

In this section we state without proof the following time and space hierarchy theorems. (See 
[127,128].) These theorems state that if one space (or time) resource bound grows sufficiently 
rapidly relative to another, the set of languages recognized within the first bound is strictly 
larger than the set recognized within the second bound. 

THEOREM 8.5. 1 (Time Hierarchy Theorem) If r(n) > n is a proper complexity function, 
then TIME(r(n)) is strictly contained in TIME(r(n) logr(n)). 



©John E Savage 8.5 Classification of Decision Problems 337 

Let r(n) and s(n) be proper functions. If for all K > there exists an N such that 
s(n) > Krin) forn > Nq, we say that r(n) is little oh of s(n) and write rin) = o(s(n)). 

THEOREM 8.5.2 (Space Hierarchy Theorem) If r(n) and s(n) are proper complexity func- 
tions and r(n) = o(s(n)), then SPACE(r(n)) is strictly contained in SPACE(s(n)). 

Theorem 8.5.3 states that there is a recursive but not proper resource function r(n) such 
that TIME(r(n)) and TIME(2 r ("') are the same. That is, for some function r(n) there is a 
gap of at least 2 r '"' — r(n) in time over which no new decision problems are encountered. 
This is a weakened version of a stronger result in [334] and independently reported by [51]. 

THEOREM 8.5.3 (Gap Theorem) There is a recursive function r(n) : B* i— > B* such that 
TIME(r(n)) = TIME(2 r ( n >). 

8.5.2 Time-Bounded Complexity Classes 

As mentioned earlier, decision problems in P are considered to be feasible while the class 
NP includes many interesting problems, such as the TRAVELING SALESPERSON problem, 
whose feasibility is unknown. Two other important complexity classes are the deterministic 
and nondeterministic exponential-time problems. By the remarks on page 336, TRAVELING 
SALESPERSON clearly falls into the latter class. 

DEFINITION 8.5.3 The classes EXPTIME and NEXPTIME consist of those decision problems 
solvable in deterministic and nondeterministic exponential time, respectively, on a Turing machine. 
That is, 



EXPTIME = (J TTME(2™* ) 

fc>0 

NEXPTIME = (J NTIME(2" fc ; 



fc>0 

We make the following observations concerning containment of these complexity classes. 
THEOREM 8.5.4 The following complexity class containments hold: 
PCNPC EXPTIME C NEXPTIME 

However, P C EXPTIME, that is, P is strictly contained in EXPTIME. 

Proof Since languages in P are recognized in polynomial time by a DTM and such machines 
are included among the NDTMs, it follows immediately that P C NP. By similar reasoning, 
EXPTIME C NEXPTIME. 

We now show that P is strictly contained in EXPTIME. P C TIME(2") follows be- 
cause TIME(n fc ) C TIME(2") for each k > 0. By the Time Hierarchy Theorem (The- 
orem 8.5.1), we have that TIME(2") C TIME(n2"). But TIME(n2") C EXPTIME. 
Thus, P is strictly contained in EXPTIME. 

Containment of NP in EXPTIME is deduced from the proof of Theorem 5.2.2 by 
analyzing the time taken by the deterministic simulation of an NDTM. If the NDTM 
executes T steps, the DTM executes 0(k ; ) steps for some constant k. ■ 
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The relationships P C NP and EXPTIME C NEXPTIME are examples of a more general 
result, namely, TIME(r(n)) C NTIME(r(n)), where these two classes of decision problems 
can respectively be solved deterministically and nondeterministically in time r(n), where n 
is the length of the input. This result holds because every V G TIME(r(n)) of length n is 
accepted in rin) steps by some DTM M-p and a DTM is also a NDTM. Thus, it is also true 
thatPeNTIME(r(n)). 

8.5.3 Space- Bounded Complexity Classes 

Many other important space complexity classes are defined by the amount of space used to 
recognize languages and compute functions. We highlight five of them here: the determin- 
istic and nondeterministic logarithmic space classes L and NL, the square-logarithmic space 
class L 2 , and the deterministic and nondeterministic polynomial-space classes PSPACE and 
NPSPACE. 

DEFINITION 8.5.4 L and NL are the decision problems solvable in logarithmic space on a DTM 
and NDTM, respectively. L are the decision problems solvable in space 0(log n) on a DTM. 
PSPACE and NPSPACE are the decision problems solvable in polynomial space on a DTM and 
NDTM, respectively. 

Because L and PSPACE are deterministic complexity classes, they are contained in NL and 
NPSPACE, respectively: that is, L C NL and PSPACE C NPSPACE. 

We now strengthen the latter result and show that PSPACE = NPSPACE, which means 
that nondeterminism does not increase the recognition power of Turing machines if they al- 
ready have access to a polynomial amount of storage space. 

The REACHABILITY problem on directed acyclic graphs defined below is used to show this 
result. REACHABILITY is applied to configuration graphs of deterministic and nondetermin- 
istic Turing machines. Configuration graphs are introduced in Section 5.3. 

REACHABILITY 

Instance: A directed graph G = (V, E) and a pair of vertices u,v £ V. 

Answer: "Yes" if there is a directed path in G from utov. 

REACHABILITY can be decided by computing the transitive closure of the adjacency matrix 
of G in parallel. (See Section 6.4.) However, a simple serial RAM program based on depth- 
first search can also solve the reachability problem. Depth-first search (DFS) on an undirected 
graph G visits each edge in the forward direction once. Edges at each vertex are ordered. Each 
time DFS arrives at a vertex it traverses the next unvisited edge. If DFS arrives at a vertex from 
which there are no unvisited edges, it retreats to the previously visited vertex. Thus, after DFS 
visits all the descendants of a vertex, it backs up, eventually returning to the vertex from which 
the search began. 

Since every T-step RAM computation can be simulated by an 0(T 3 )-step DTM computa- 
tion (see Problem 8.6), a cubic-time DTM program based on DFS exists for REACHABILITY. 
Unfortunately, the space to execute DFS on the RAM and Turing machine both can be linear 
in the size of the graph. We give an improved result that allows us to strengthen PSPACE C 
NPSPACE to PSPACE = NPSPACE. 

Below we show that REACHABILITY can be realized in quadratic logarithmic space. This 
fact is then used to show that NSPACE(r(n)) C SPACE(r- 2 (n)) for r(n) = il(log n). 
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THEOREM 8.5.5 (Savitch) REACHABILITY is in SPACE(log 2 n). 

Proof As mentioned three paragraphs earlier, the REACHABILITY problem on a graph G = 
( V, E) can be solved with depth-first search. This requires storing data on each vertex visited 
during a search. This data can be as large as 0(n), n = \V\. We exhibit an algorithm that 
uses much less space. 

Given an instance of REACHABILITY defined by G = (V, E) and u, v e V, for each 
pair of vertices (a, b) and integer k < [log 2 n\ we define predicates PATH(a, b, 2 1 whose 
value is true if there exists a path from a to b in G whose length is at most 2 and false other- 
wise. Since no path has length more than n, the solution to the REACHABILITY problem is 
the value of PATH (u, V, 2 riog ^ Tl] ) . The predicates PATH (a, b, 2°) are true if either a = b 
or there is a path of length 1 (an edge) between the vertices a and b. Thus, PATH (a, b, 2 ) 
can be evaluated directly by consulting the problem instance on the input tape. 

The algorithm that computes PATH(u, v, 2' 1oS2 ™1) with space 0(log n) uses the 
fact that any path of length at most 2 can be decomposed into two paths of length at 



most 2 



fc-l 



Thus, if PATH! a, 6,2 1 is true, then there must be some vertex z such that 



tk-\\ 



are both true. The truth of PATH ( 

-ik-\\ 



PATH(a,z,2 fc - 1 ) andPATH(, 

be established by searching for a z such that PATH (a, z, 2 fc_1 ) is true. Upon finding one, 
we determine the truth of PATH(z, b, 2 fc ~ 1 ) . Failing to find such a z, PATH (a, b, 2 k ) is 
declared to be false. Each evaluation of a predicate is done in the same fashion, that is, re- 
cursively. Because we need evaluate only one of PATH (a, z, 2 fc_1 ) and PATH (2, b, 2 fc_1 ) 
at a time, space can be reused. 

We now describe a deterministic Turing machine with an input tape and two work tapes 
computing PATH(u, v,2' 1o&2 n '). The input tape contains an instance of REACHABILITY, 
which means it has not only the vertices u and V but also a description of the graph G. The 
first work tape will contain triples of the form (a, b, k), which are called activation records. 
This tape is initialized with the activation record (u, v, [log 2 n\). (See Fig. 8.5.) 

The DTM evaluates the last activation record, (a,b,k), on the first work tape as de- 
scribed above. There are three kinds of activation records, complete records of the form 
(a, b, k), initial segments of the form (a, z,k— 1), and final segments of the form (z, b,k — 
1). The first work tape is initialized with the complete record (u, v, [log 2 n~\ ). 

An initial segment is created from the current complete record (a, b, k) by selecting a 
vertex z to form the record (a, z,k — 1), which becomes the current complete record. If 
it evaluates to true, it can be determined to be an initial or final segment by examining the 
previous record (a, b, k). If it evaluates to false, (a, z,k — 1) is erased and another value 
of z, if any, is selected and another initial segment placed on the work tape for evaluation. 
If no other z exists, (a, z,k — 1) is erased and the expression PATH (a, b, 2 ) is declared 
false. If (a, z,k — 1) evaluates to true, the final record (z,b,k — 1) is created, placed on the 
work tape, and evaluated in the same fashion. As mentioned in the second paragraph of this 
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Figure 8.5 A snapshot of the stack used by the REACHABILITY algorithm in which the com- 
ponents of an activation record (a, b, k) are distributed over several cells. 
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proof, (a, b, 0) is evaluated by consulting the description of the graph on the input tape. The 
second work tape is used for bookkeeping, that is, to enumerate values of z and determine 
whether a segment is initial or final. 

The second work tape uses space O(logn). The first work tape contains at most 
[log 2 n~\ activation records. Each activation record (a, b, k) can be stored in O(logn) space 
because each vertex can be specified in O(logn) space and the depth parameter k can be 
specified in 0(log k) = 0(log log n) space. It follows that the first work tape uses at most 
0(log n) space. ■ 

The following general result, which is a corollary of Savitch's theorem, demonstrates that 
nondeterminism does not enlarge the space complexity classes if they are defined by space 
bounds that are at least logarithmic. In particular, it implies that PSPACE = NPSPACE. 

COROLLARY 8.5. 1 Let r(n) be a proper Turing computable function r : IN i— > ^satisfying 
r{n) = fi(logn). Then NSPACE(r(n)) C SPACE(r 2 (n)). 

Proof Let Mnd be an NDTM with input and output tapes and s work tapes. Let it recog- 
nize a language L G NSPACE(r(ri)). For each input string w, we generate a configuration 
graph G(A/nd> w ) of Mnq. (See Fig. 8.6.) We use this graph to determine whether or not 
w G L. M^d has at most |Q| states, each tape cell can have at most c values (there are 
c (s+2)r(n) conn g Urat i ons f or t he s + 2 tapes), the s work tape heads and the output tape 
head can assume values in the range 1 < hj < r(n), and the input head h s+ \ can assume 
one of n positions (there are nr(n) s+1 configurations for the tape heads). It follows that 
M ND has at most \Q\c ( - s+2 '> r ^'>{nr(n) s+1 ) < fc lo s»+''(™) configurations. G(M ND ,w) 
has the same number of vertices as there are configurations and a number of edges at most 
the square of its number of vertices. 

Let L G NSPACE(r(n)) be recognized by an NDTM Mnd- We describe a determin- 
istic r (n)-space Turing machine A1q recognizing L. For input string w G L of length n, 
this machine solves the REACHABILITY problem on the configuration graph G(A/nd> w ) 
of Mnd described above. However, instead of placing on the input tape the entire configu- 
ration graph, we place the input string w and the description of Mnd- We keep configura- 
tions on the work tape as part of activation records (they describe vertices of G(Mnd> w ))- 




Figure 8.6 The acyclic configuration graph G(A/nd, iv) of a nondeterministic Turing machine 
A/nd on input w has one vertex for each configuration of A/nd ■ Here heavy edges identify the 
nondeterministic choices associated with a configuration. 
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Each of the vertices (configurations) adjacent to a particular vertex can be deduced from the 
description of Mnd . 

Since the number of configurations of Mnd is N = O ^ lo g™+ r (™)j ; eacn configura- 
tion or activation record can be stored as a string of length 0(r(n)). 

From Theorem 8.5.5, the reachability in G(AI^£,,w) of the final configuration from 
the initial one can be determined in space (9(log 2 N). But N = O ffc lo s™+'"(™)) j from 
which it follows that NSPACE(r(n)) C SPACE(r 2 (n)). ■ 

The classes NL, L and PSPACE are defined as unions of deterministic and nondetermin- 
istic space-bounded complexity classes. Thus, it follows from this corollary that NL CL ! C 
PSPACE. However, because of the space hierarchy theorem (Theorem 8.5.2), it follows that 
L 2 is contained in but not equal to PSPACE, denoted L 2 C PSPACE. 

8.5.4 Relations Between Time- and Space- Bounded Classes 

In this section we establish a number of complexity class containment results involving both 
space- and time-bounded classes. We begin by proving that the nondeterministic 0(r(n))- 
space class is contained within the deterministic O (A: r ^ n ')-time class. This implies that NL C 
P and NPSPACE C EXPTIME. 

THEOREM 8.5.6 The classes NSPACE(r(ra)) and TIME(r(n)) of decision problems solvable in 
nondeterministic space and deterministic time r(n), respectively, satisfy the following relation for 
some constant k > 0: 

NSPACE(r(n)) C TIME(/j 1o s «+''(»)) 

Proof Let Mnd accept a language L G NSPACE(r(n)) and let G(Mnd,«^) be the 
configuration graph for Mnd on input w. To determine if w is accepted by Mnd and 
therefore in L, it suffices to determine if there is a path in G(M^d,iv) from the initial 
configuration of A/nd to the final configuration. This is the REACHABILITY problem, 
which, as stated in the proof of Theorem 8.5.5, can be solved by a DTM in time polynomial 
in the length of the input. When this algorithm needs to determine the descendants of a 
vertex in G(M^y),w), it consults the definition of A/nd to determine the configurations 
reachable from the current configuration. It follows that membership of w in L can be 
determined in time O (fc log «+ r (»)) for some k > 1 or that L is in TIME(fc lo s «+''(»)). a 

COROLLARY 8.5.2 NL C P and NPSPACE C EXPTIME 

Later we explore the polynomial-time problems by exhibiting other important complexity 
classes that reside inside P. (See Section 8.15.) We now show containment of the nondeter- 
ministic time complexity classes in deterministic space classes. 

THEOREM 8.5.7 The following containment holds: 

NTIME(r(n)) C SPACE(r(n)) 

Proof We use the construction of Theorem 5.2.2. Let L be a language in NTIME(r(n)). 
We note that the choice string on the enumeration tape converts the nondeterministic recog- 
nition of L into deterministic recognition. Since L is recognized in time r(n) for some 
accepting computation, the deterministic enumeration runs in time r(n) for each choice 
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Figure 8.7 The relationships among complexity classes derived in this section. Containment is 
indicated by arrows. 



string. Thus, 0(r(n)) cells are used on the work and enumeration tapes in this determinis- 
tic simulation and L is in PSPACE. ■ 

An immediate corollary to this theorem is that NP C PSPACE. This implies that P C 
EXPTIME. However, as mentioned above, P is strictly contained within EXPTIME. 
Combining these results, we have the following complexity class inclusions: 

L C NL C P C NP C PSPACE C EXPTIME C NEXPTIME 

where PSPACE = NPSPACE. We also have L 2 C PSPACE, and P C EXPTIME, which 
follow from the space and time hierarchy theorems. These inclusions and those derived below 
are shown in Fig. 8.7. 

In Section 8.6 we develop refinements of this partial ordering of complexity classes by using 
the complements of complexity classes. 

We now digress slightly to discuss space-bounded functions. 

8.5.5 Space-Bounded Functions 

We digress briefly to specialize Theorem 8.5.6 to log-space computations, not just log-space 
language recognition. As the following demonstrates, log-space computable functions are com- 
putable in polynomial time. 

THEOREM 8.5.8 Let A I be a DTM that halts on all inputs using space O (log n) to process inputs 
of length n. Then M executes a polynomial number of steps. 

Proof In the proof of Corollary 8.5.1 the number of configurations of a Turing machine M 
with input and output tapes and s work tapes is counted. We repeat this analysis. Let r(n) 
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be the maximum number of tape cells used and let c be the maximal size of a tape alphabet. 
Then, M can be in one of at most \ < c' s+2 ' r '"'(nr(n) s+l ) = 0(k r ^) configurations 
for some k > 1. Since M always halts, by the pigeonhole principle, it passes through at 
most x configurations in at most \ steps. Because r(n) = O(logn), \ = 0(n ) for some 
integer d. Thus, M executes a polynomial number of steps. ■ 

8.6 Complements of Complexity Classes 

As seen in Section 4.6, the regular languages are closed under complementation. However, we 
have also seen in Section 4.13 that the context-free languages are not closed under comple- 
mentation. Thus, complementation is a way to develop an understanding of the properties of 
a class of languages. In this section we show that the nondeterministic space classes are closed 
under complements. The complements of languages and decision problems were defined at 
the beginning of this chapter. 

Consider REACHABILITY. Its complement REACHABILITY is the set of directed graphs 
G = (V,E) and pairs of vertices u,v £ V such that there are no directed paths between u 
and v. It follows that the union of these two problems is not the entire set of strings over B* 
but the set of all instances consisting of a directed graph G = (V,E) and a pair of vertices 
U, V € V. This set is easily detected by a DTM. It must only verify that the string describing a 
putative graph is in the correct format and that the representations for u and v are among the 
vertices of this graph. 

Given a complexity class, it is natural to define the complement of the class. 

DEFINITION 8.6. 1 The complement of a complexity class of decision problems C, denoted 
coC, is the set of decision problems that are complements of decision problems in C. 

Our first result follows from the definition of the recognition of languages by DTMs. 

THEOREM 8.6. 1 If C is a deterministic time or space complexity class, then coC = C, 

Proof Every L 6 C is recognized by a DTM AI that halts within the resource bound 
of C for every string, whether in L or L, the complement of L. Create M from AI by 
complementing the accept/reject status of states of M's control unit. Thus, L, which by 
definition is in coC, is also in C. That is, coC C C. Similarly, C C coC. Thus, coC = C. ■ 

In particular, this result says that the class P is closed under complements. That is, if the 
"yes" instances of a decision problem can be answered in deterministic polynomial time, then 
so can the "No" instances. 

We use the above theorem and Theorem 5.7.6 to give another proof that there are problems 
that are not in P. 

COROLLARY 8.6. 1 There are languages not in P, that is, languages that cannot be recognized 
deterministically in polynomial time. 

Proof Since every language in P is recursive and C\ defined in Section 5.7.2 is not recursive, 
it follows that C\ is not in P. ■ 

We now show that all nondeterministic space classes with a sufficiently large space bound 
are also closed under complements. This leaves open the question whether the nondetermin- 
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istic time classes are closed under complement. As we shall see, this is intimately related to the 

question P = NP. 

As stated in Definition 5.2.1, for no choices of moves is an NDTM allowed to produce an 
answer for which it is not designed. In particular, when computing a function it is not allowed 
to give a false answer for any set of nondeterministic choices. 

THEOREM 8.6.2 (Immerman-Szelepscenyi) Given a graph G = (V,E) and a vertex v, the 
number of vertices reachable from v can be computed by an NDTM in space 0(log n), n = | V|. 

Proof Let V = {1,2, . . . ,n}. Any node reachable from a vertex v must be reachable via a 
path of length (number of edges) of at most n — 1, n = \V\. Let R(k,u) be the number 
of vertices of G reachable from u by paths of length k or less. The goal is to compute 
R(n — l,u). A deterministic program for this purpose could be based on the predicate 
PATH(w, v, k) that has value 1 if there is a path of length k or less from vertex u to vertex 
v and otherwise and the predicate ADJACENT-OR-IDENTICAL(x, v) that has value 1 if 
x = v or there is an edge in G from x to v and otherwise. (See Fig. 8.8.) If we let the 
vertices be associated with the integers in the interval [1, . . . , n], then R(n — l,u) can be 
evaluated as follows: 



R(n-l,u)= Y^ PATH(«,v,n-l) 

l<v<n 

= V X/ PATH ( W > x > n ~ 2)adjacent-or-equal(:e, v) 



Kv<n Ki<n 



When this description of R(n — 1, u) is converted to a program, the amount of storage 
needed grows more rapidly than 0(log n). However, if the inner use of PATH(m, x, n — 2) 
is replaced by the nonrecursive and nondeterministic test EXISTS-PATH-FROM-w-TO-w-< 
LENGTH of Fig. 8.9 for a path from u to x of length n — 2, then the space can be kept to 
0(log n). This test nondeterministically guesses paths but verifies deterministically that all 
paths have been explored. 

The procedure COUNTING-REACHABILITY of Fig. 8.9 is a nondeterministic program 
computing R(n - l,u). It uses the procedure #-VERTICES-AT-<-DISTANCE-FROM-u 
to compute the number of vertices at distance dist or less from u in order of increasing 
values of dist. (It computes dist correctly or fails.) This procedure has prev_num_dist 
as a parameter, which is the number of vertices at distance dist — 1 or less. It passes this 




(a) (b) 

Figure 8.8 Paths explored by the REACHABILITY algorithm. Case (a) applies when x and v are 
different and (b) when they are the same. 
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counting-reachability(u) 

{R(k, u) = number of vertices at distance < k from u in G = (V, E)} 
prevjnurri-dist := 1; {num_dist = R(0,u)} 
for dist := 1 to n — 1 

num_dist := #-VERTICES-AT-<-DIST-FROM-w(dis£, u,prev jnum_dist) 
prevjnurri-dist := nurrudist 
{num_dist = R(dist,u)} 
retum(num_dist) 

#-VERTlCES-AT-<-DlSTANCE-EROM-u(dist, u,prev_num_dist) 
{Returns R(dist, u) given prev-numAist = R{dist — 1, u) or fails} 
num_nodes := 
for lastjnode := 1 to n 

if IS-NODE-AT-<-DIST-FROM-M(dist, u, lastjnode, prev _num_dist) then 
numjnodes := numjnodes + 1 
return (numjnodes) 

lS-NODE-AT-<-DlST-FROM-u(dist,u,last_node,prev_num_dist) 

{numjnode - number of vertices at distance < dist from u found so far} 

numjnode := 0; 

reply := false 

for next -to last -node := 1 to n 

if EXlSTS-PATH-FROM-u-TO-u-<-LENGTH(u, next-toJast-node, dist - 1) then 
numjnode := numjnode + 1 {count number of next-to-last nodes or fail} 
if ADJACENT-OR-IDENTICAL(nexi_foJasi_nocte, lastjnode) then 
reply := true 
if numjnode < prevjnum_dist then 

fail 
else return (reply) 

EXISTS-PATH-FROM-u-TO-w-<-LENGTH(u, v, dist) 

{nondeterministically choose at most dist vertices, fail if they don't form a path} 

nodeA := u 

for count := 1 to dist 

node J. := NONDETERMINISTIC-GUESS([l, .., n}) 

if not ADJACENT-OR-IDENTICAL(node_l, node J.) then 
fail 

else nodeA := nodeJ2 
if node _2 = v then 

return(true) 
else 

return(false) 

Figure 8.9 A nondeterministic program counting vertices reachable from u. Comments are 
enclosed in braces {, }. 
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value to the procedure IS-NODE-AT-<-DIST-FROM-u, which examines and counts all pos- 
sible next _to _last_nodes reachable from u. #-VERTICES-AT-<-DISTANCE-FROM-w ei- 
ther fails to find all possible vertices at distance dist — 1, in which case it fails, or finds all 
such vertices. Thus, it nondeterministically verifies that all possible paths from u have been 
explored. IS-NODE-AT-<-DIST-FROM-u uses the procedure EXISTS-PATH-FROM-w-TO- 
u-<-LENGTH that either correctly verifies that a path of length dist — 1 exists from u to 
next _to last _node or fails. In turn, EXISTS-PATH-FROM-u-TO-w-<-LENGTH uses the 
command NONDETERMINISTIC-GUESS([l, .., n]) to nondeterministically choose nodes 
on a path from uto v. 

Since this program is not recursive, it uses a fixed number of variables. Because these 
variables assume values in the range [1,2, 3, . . . , n], it follows that space 0(log n) suffices 
to implement it on an NDTM. ■ 

We now extend this result to nondeterministic space computations. 

COROLLARY 8.6.2 If r(n) = Q(logn) is proper, NSPACE(r(n)) = coNSPACE(r(n)). 

Proof Let L e NSPACE(r(n)) be decided by an r(n)-space bounded NDTM M. We 
show that the complement of L can be decided by a nondeterministic r(n) -space bounded 
Turing machine M, stopping on all inputs. We modify slightly the program of Fig. 8.9 for 
this purpose. The graph G is the configuration graph of A/. Its initial state is determined 
by the string w that is initially written on M 's input tape. To determine adjacency between 
two vertices in the configuration graph, computations of M are simulated on one of M's 
work tapes. 

M computes a slightly modified version of COUNTING-REACHABILITY. First, if the 
procedure IS-NODE-AT-LENGTH-<-DIST-FROM-u returns true for a vertex u that is a 
halting accepting configuration of M, then M halts and rejects the string. If the procedure 
COUNTING-REACHABILITY completes successfully without rejecting any string, then M 
halts and accepts the input string because every possible accepting computation for the input 
string has been examined and none of them is accepting. This computation is nondetermin- 
istic. 

The space used by M is the space needed for COUNTING-REACHABILITY, which 
means it is 0(logN), where N is the number of vertices in the configuration graph of 
M plus the space for a simulation of M, which is 0(r(n)). Since N = O(k lo ^ n+r ^) 
(see the proof of Theorem 8.5.6), the total space for this computation is 0(logn + r(n)), 
which is 0(r(n)) if r(n) = H(logn). By definition L G coNSPACE(r(n)). From the 
above construction L £ NSPACE(r(n)). Thus, coNSPACE(r(n)) C NSPACE(r(n)). 

By similar reasoning, if L g coNSPACE(r(n)), then L G NSPACE(r(n)), which im- 
plies that NSPACE(r(n)) C coNSPACE(r(n)); that is, they are equal. ■ 

The lowest class in the space hierarchy that is known to be closed under complements is 
the class NL; that is, NL = coNL. This result is used in Section 8.11 to show that the problem 
2-SAT, a specialization of the NP-complete problem 3-SAT, is in P. 

From Theorem 8.6.1 we know that all deterministic time and space complexity classes are 
closed under complements. From Corollary 8.6.2 we also know that all nondeterministic space 
complexity classes with space il(logn) are closed under complements. However, we do not 
yet know whether the nondeterministic time complexity classes are closed under complements. 
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This important question is related to the question whether P = NP, because if NP 7^ coNP, 
then P 7^ NP because P is closed under complements but NP is not. 

8.6.1 The Complement of NP 

The class coNP is the class of decision problems whose complements are in NP. That is, 
coNP is the language of "No" instances of problems in NP. The decision problem VALIDITY 
defined below is an example of a problem in coNP. In fact, it is log-space complete for coNP. 
(See Problem 8.10.) VALIDITY identifies SOPEs (the sum-of-products expansion, defined in 
Section 2.3) that can have value 1. 

VALIDITY 

Instance: A set of literals X = {x\,X\,X2,X2, ■ ■ ■ ,x n ,x n }, and a sequence of products 

P = (p\,P2, ■ ■ ■ ,p m )> where each product Pi is a subset of X . 

Answer: "Yes" if for all assignments of Boolean values to variables in {xi, x-i, . . . ,x n } every 

literal in at least one product has value 1 . 

Given a language L in NP, a string in L has a certificate for its membership in L consisting 
of the set of choices that cause its recognizing Turing machine to accept it. For example, a 
certificate for SATISFIABILITY is a set of values for its variables satisfying at least one literal 
in each sum. For an instance of a problem in coNP, a disqualification is a certificate for the 
complement of the instance. An instance in co VALIDITY is disqualified by an assignment that 
causes all products to have value 0. Thus, each "Yes" instance in VALIDITY is disqualified by 
an assignment that prevents the expression from being valid. (See Problem 8.11.) 

As mentioned just before the start of this section, if NP 7^ coNP, then P 7^ NP because P 
is closed under complements. Because we know of no way to establish NP 7^ coNP, we try to 
identify a problem that is in NP but is not known to be in P. A problem that is NP and coNP 
simultaneously (the class NP n coNP) is a possible candidate for a problem that is in NP but 
not P, which would show that P 7^ NP. We show that PRIMALITY is in NP n coNP. (It is 
straightforward to show that P C NP fl coNP. See Problem 8.12.) 

PRIMALITY 

Instance: An integer n written in binary notation. 

Answer: "Yes" if n is a prime. 

A disqualification for PRIMALITY is an integer that is a factor of n. Thus, the complement 
of PRIMALITY is in NP, so PRIMALITY is in coNP. We now show that PRIMALITY is also in 
NP or that it is in NP H coNP. To prove the desired result we need the following result from 
number theory, which we do not prove (see [235, p. 222] for a proof). 

THEOREM 8.6.3 An integer p > 2 is prime if and only if there is an integer 1 < r < p such that 
r v ~ l = 1 mod p and for all prime divisors q ofp — 1, r ( - p ~ 1 '' q 7^ 1 mod p. 

As a consequence, to give evidence of primality of an integer p > 1, we need only provide 
an integer r, 1 < r < p, and the prime divisors {q\, . . . , qif\ other than 1 ofp — 1 and then 
show that r p_1 = 1 mod p and A p ~ l '< q 7^ 1 mod p for q G {q\, . . . , qk}. By the theorem, 
such integers exist if and only if p is prime. In turn, we must give evidence that the integers 
{<7i, . . . , qk} are prime divisors of p — 1, which requires showing that they divide p — 1 and 
are prime. We must also show that k is small and that the recursive check of the primes does 
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not grow exponentially. Evidence of the primality of the divisors can be given in the same way, 
that is, by exhibiting an integer Tj for each prime as well as the prime divisors of qj — 1 for 
each prime Qj . We must then show that all of this evidence can be given succinctly and verified 
deterministically in time polynomial in the length n of p. 

THEOREM 8.6.4 PRIMALITY is in NP H coNP. 

Proof We give an inductive proof that PRIMALITY is in NP. For a prime p we give its 
evidence E(j>) as (p; r, E(q{), . . . , E{q^)), where E(qj) is evidence for the prime qj. We 
let the evidence for the base case p = 2 be E(2) = (2). Then, E(3) = (3; 2, (2)) because 
r = 1 works for this case and 2 is the only prime divisor of 3 — 1, and (2) is the evidence for 
it. Also, E{5) = (5; 3, (2)). The length |-E(p)| of the evidence E{p) on p is the number 
of parentheses, commas and bits in integers forming part of the evidence. 

We show by induction that |£"(p)| is at most 4\og 2 p. The base case satisfies the hy- 
pothesis because |-E(2)| = 4. 

Because the prime divisors {qi, ■ ■ ■ ,qk} satisfy qi > 2 and q\q 2 • ■ ■ qk < P~ 1, it follows 
that k < [log 2 pJ < n. Also, since p is prime, it is odd and p — 1 is divisible by 2. Thus, 
the first prime divisor of p — 1 is 2. 

Let E(p) = (p;r, E(2), E(q 2 ), . . . , E(qk))- Let the inductive hypothesis be that 
I^Cp)! < 41og 2 p. Let rij = \og 2 qj. From the definition of E(p) we have that |-E(p)| 
satisfies the following inequality because at most n bits are needed for p and r, there are 
fc — 1 < n — 1 commas and three other punctuation marks, and |-E(2)| = 4. 



\E{p)\ <3n + 6 + 4 J2 

2<j<k 



n) 



Since the qj are the prime divisors of p — 1 and some primes may be repeated in p — 1 , 
their product (which includes q\ = 2) is at most p — \. It follows that ^2 2 <i<k n i — 
log 2 ^i<j<k<lj < l°g((p — l)/2). Since the sum of the squares of rij is less than or equal 
to the square of the sum of rij, it follows that the sum in the above expression is at most 
(log 2 p- l) 2 < (n- l) 2 . But3n + 6 + 4{n- l) 2 = 4n 2 -5n+ 10 < An 1 when n > 2. 
Thus, the description of a certificate for the primality of p is polynomial in the length n of p. 

We now show by induction that a prime p can be verified in Oin ) steps on a RAM. 
Assume that the divisors q\ , . . . , qk for p — 1 have been verified. To verify p, we compute 
r p-i rnod p from r and p as well as r^ p ~ ' 1 '' q mod p for each of the prime divisors q of 
p — 1 and compare the results with 1. The integers (p — l)/q can be computed through 
subtraction of n-bit numbers in 0(n ) steps on a RAM. To raise r to an exponent e, rep- 
resent e as a binary number. For example, if e = 7, write it as p = 2 2 + 2 1 + 2°. If t 
is the largest such power of 2, t < log 2 (p — 1) < n. Compute r 2 mod p by squaring 
r j times, each time reducing it by p through division. Since each squaring/reduction step 
takes 0(n 2 ) RAM steps, at most 0(jn 2 ) RAM steps are required to compute r 2 . Since 
this may be done for 2 < j < t and ^2 2 <j<t 3 = ^(^ 2 )' at most 0(n 3 ) RAM steps suffice 
to compute one of r p ~ mod p or r^ p ~ >' q mod p for a prime divisor q. Since there are at 
most n of these quantities to compute, 0(n ) RAM steps suffice to compute them. 

To complete the verification of the prime p, we also need to verify the divisors q\ , . . . , qk 
ofp — 1. We take as our inductive hypothesis that an arbitrary prime q of n bits can be veri- 
fied in 0(n 5 ) steps. Since the sum of the number of bits mq 2 , . . . , qk is (log 2 (p— l)/2— 1) 
and the sum of the fcth powers is no more than the fcth power of the sum, it follows that 
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Oin ) RAM steps suffice to verify p. Since a polynomial number of RAM steps can be 
executed in a polynomial number of Turing machine steps, PRIMALITY is in NP. ■ 

Since NP n coNP C NP and NP H coNP C coNP as well as NP C NP U coNP and 

coNP C NP U coNP, we begin to have the makings of a hierarchy. If we add that coNP 
C PSPACE (see Problem 8.13), we have the relationships between complexity classes shown 
schematically in Fig. 8.7. 



8.7 Reductions 

In this section we specialize the reductions introduced in Section 2.4 and use them to classify 
problems into categories. We show that if problem A is reduced to problem B by a function 
in the set R and A is hard relative to R, then B cannot be easy relative to R because A can 
be solved easily by reducing it to B and solving B with an easy algorithm, contradicting the 
fact that A is hard. On the other hand, if B is easy to solve relative to R, then A must be 
easy to solve. Thus, reductions can be used to show that some problems are hard or easy. Also, 
if A can be reduced to B by a function in R and vice versa, then A and B have the same 
complexity relative to R. 

Reductions are widely used in computer science; we use them whenever we specialize one 
procedure to realize another. Thus, reductions in the form of simulations are used throughout 
Chapter 3 to exhibit circuits that compute the same functions that are computed by finite- 
state, random-access, and Turing machines, with and without nondeterminism. Simulations 
prove to be an important type of reduction. Similarly, in Chapter 10 we use simulation to show 
that any computation done in the pebble game can be simulated by a branching program. 

Not only did we simulate machines with memory by circuits in Chapter 3, but we demon- 
strated in Sections 3.9.5 and 3.9.6 that the languages CIRCUIT VALUE and CIRCUIT SAT 
describing circuits are P-complete and NP-complete, respectively. We demonstrated that each 
string x in an arbitrary language in P (NP) could be translated into a string in CIRCUIT VALUE 
(respectively, CIRCUIT SAT) by a program whose running time is polynomial in the length of 
x and whose space is logarithmic in its length. 

In this chapter we extend these results. We consider primarily transformations (also called 
many-one reductions and just reductions in Section 5.8.1), a type of reduction in which an 
instance of one decision problem is translated to an instance of a second problem such that the 
former is a "yes" instance if and only if the latter is a "yes" instance. A Turing reduction is a 
second type of reduction that is defined by an oracle Turing machine. (See Section 8.4.2 and 
Problem 8.15.) In this case the Turing machine may make more than one call to the second 
problem (the oracle). A transformation is equivalent to an oracle Turing reduction that makes 
one call to the oracle. Turing reductions subsume all previous reductions used elsewhere in this 
book. (See Problems 8.15 and 8.16.) However, since the results of this section can be derived 
with the weaker transformations, we limit our attention to them. 

DEFINITION 8.7. 1 If L\ and Li are languages, a transformation h from L\ to L2 is a DTM- 
computable function h : B* 1— > B* such that x £ L\ if and only if h(x) 6 Li. A resource- 
bounded transformation is a transformation that is computed under a resource bound such as 
deterministic logarithmic space or polynomial time. 
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The classification of problems is simplified by considering classes of transformations. These 
classes will be determined by bounds on resources such as space and time on a Turing machine 
or circuit size and depth. 

DEFINITION 8.7.2 For decision problems V i andVi, the notation V\ <R V 2 means that V\ can 
be transformed to V 2 by a transformation in the class R. 

Compatibility among transformation classes and complexity classes helps determine con- 
ditions under which problems are hard. 

DEFINITION 8.7.3 Let C be a complexity class, R a class of resource-bounded transformations, and 
V\ andV 2 decision problems. A set of transformations R is compatible with C if V\ <r V 2 
andV 2 € CthenVi € C. 

It is easy to see that the polynomial-time transformations (denoted < p ) are compatible 
with P. (See Problem 8.17.) Also compatible with P are the log-space transformations (de- 
noted <i g-space) associated with transformations that can be computed in logarithmic space. 
Log-space transformations are also polynomial transformations, as shown in Theorem 8.5.8. 



8.8 Hard and Complete Problems 



Classes of problems are defined above by their use of space and time. We now set the stage for 
the identification of problems that are hard relative to members of these classes. A few more 
definitions are needed before we begin this task. 

DEFINITION 8.8. 1 A class R of transformations is transitive if the composition of any two trans- 
formations in R is also in R and for all problems V\, V 2 , andV^, V\ <r V 2 andVi <r V3 
implies that V\ <rV}. 

If a class R of transformations is transitive, then we can compose any two transformations 
in the class and obtain another transformation in the class. Transitivity is used to define hard 
and complete problems. 

The transformations < p and <i og -space described above are transitive. Below we show 
that < log-space is transitive and leave to the reader the proof of transitivity of < p and the 
polynomial-time Turing reductions. (See Problem 8.19.) 

THEOREM 8.8. 1 Log-space transformations are transitive. 

Proof A log-space transformation is a DTM that has a read-only input tape, a write-only 
output tape, and a work tape or tapes on which it uses O(logn) cells to process an input 
string w of length n. As shown in Theorem 8.5.8, such DTMs halt within polynomial time. 
We now design a machine T that composes two log-space transformations in logarithmic 
space. (See Fig. 8.10.) 

Let M\ and M 2 denote the first and second log-space DTMs. When M\ and M 2 are 
composed to form T, the output tape of Mi, which is also the input tape of A/2, becomes 
a work tape of T. Since M\ may execute a polynomial number of steps, we cannot store all 
its output before beginning the computation by A/2. Instead we must be more clever. We 
keep the contents of the work tapes of both machines as well as (and this is where we are 
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Figure 8. 1 The composition of two deterministic log-space Turing machines. 



clever) an integer h\ recording the position of the input head of M% on the output tape of 
M\ . If M2 moves its input head right by one step, M\ is simulated until one more output 
is produced. If its head moves left, we decrement hi, restart M\, and simulate it until h\ 
outputs are produced and then supply this output as an input to M 2 . 

The space used by this simulation is the space used by M\ and M.% plus the space for 
hi, the value under the input head of Mj and some temporary space. The total space is 
logarithmic in n since hi is at most a polynomial in n. ■ 

We now apply transitivity of reductions to define hard and complete problems. 

DEFINITION 8.8.2 Let R be a class of reductions, let C be a complexity class, and let R be com- 
patible with C . A problem Q is hard for C under i?-reductions if for every problem V £ C, 
V <r Q. A problem Q is complete for C under /^.-reductions if it is hard for C under 
R-reductions and is a member of C. 

Problems are hard for a class if they are as hard to solve as any other problem in the class. 
Sometimes problems are shown hard for a class without showing that they are members of that 
class. Complete problems are members of the class for which they are hard. Thus, complete 
problems are the hardest problems in the class. We now define three important classes of 
complete problems. 
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DEFINITION 8.8.3 Problems in P that are bard for P under log-space reductions are called P- 
complete. Problems in NP that are hard for NP under polynomial-time reductions are called NP- 
complete. Problems in PSPACE that are hard for PSPACE under polynomial-time reductions are 
called PSPACE-complete. 

We state Theorem 8.8.2, which follows directly from Definition 8.7.3 and transitivity of 
log-space and polynomial-time reductions, because it incorporates as conditions the goals of 
the study of P-complete, NP-complete, and PSPACE-complete problems, namely, to show 
that all problems in P can be solved in log-space and all problems in NP and PSPACE can be 
solved in polynomial time. It is unlikely that any of these goals can be reached. 

THEOREM 8.8.2 If a V -complete problem can be solved in log-space, then all problems in P can 
be solved in log-space. If an NP * -complete problem is in P, then P = NP. If a PSPACE-complete 
problem is in P, then P = PSPACE. 

In Theorem 8. 14.2 we show that if a P-complete problem can be solved in poly-logarithmic 
time with polynomially many processors on a CREW PRAM (they are fully parallelizable), 
then so can all problems in P. It is considered unlikely that all languages in P can be fully par- 
allelized. Nonetheless, the question of the parallelizability of P is reduced to deciding whether 
P-complete problems are parallelizable. 



8.9 P-Complete Problems 



To show that a problem V is P-complete we must show that it is in P and that all problems 
in P can be reduced to V via a log-space reduction. (See Section 3.9.5.) The task of showing 
this is simplified by the knowledge that log-space reductions are transitive: if another problem 
Q has already been shown to be P-complete, to show that V is P-complete it suffices to show 
there is a log-space reduction from Q to V and that V € P. 

CIRCUIT VALUE 

Instance: A circuit description with fixed values for its input variables and a designated 

output gate. 

Answer: "Yes" if the output of the circuit has value 1 . 

In Section 3.9.5 we show that the CIRCUIT VALUE problem described above is P-complete 
by demonstrating that for every decision problem V in P an instance w of V and a DTM AI 
that recognizes "Yes" instances of V can be translated by a log-space DTM into an instance c 
of CIRCUIT VALUE such that id is a "Yes" instance of V if and only if c is a "Yes" instance of 
CIRCUIT VALUE. 

Since P is closed under complements (see Theorem 8.6.1), it follows that if the "Yes" in- 
stances of a decision problem can be determined in polynomial time, so can the "No" instances. 
Thus, the CIRCUIT VALUE problem is equivalent to determining the value of a circuit from its 
description. Note that for CIRCUIT VALUE the values of all variables of a circuit are included 
in its description. 

CIRCUIT VALUE is in P because, as shown in Theorem 8.13.2, a circuit can be evaluated 
in a number of steps proportional at worst to the square of the length of its description. Thus, 
an instance of CIRCUIT VALUE can be evaluated in a polynomial number of steps. 
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Monotone circuits are constructed of AND and OR gates. The functions computed by 
monotone circuits form an asymptotically small subset of the set of Boolean functions. Also, 
many important Boolean functions are not monotone, such as binary addition. But even 
though monotone circuits are a very restricted class of circuits, the monotone version of CIR- 
CUIT VALUE, defined below, is also P-complete. 

MONOTONE CIRCUIT VALUE 

Instance: A description for a monotone circuit with fixed values for its input variables and 

a designated output gate. 

Answer: "Yes" if the output of the circuit has value 1 . 

CIRCUIT VALUE is a starting point to show that many other problems are P-complete. We 
begin by reducing it to MONOTONE CIRCUIT VALUE. 

THEOREM 8.9. 1 MONOTONE CIRCUIT VALUE is P-complete. 

Proof As shown in Problem 2.12, every Boolean function can be realized with just AND 
and OR gates (this is known as dual-rail logic) if the values of input variables and their 
complements are made available. We reduce an instance of CIRCUIT VALUE to an instance 
of MONOTONE CIRCUIT VALUE by replacing each gate with the pair of monotone gates 
described in Problem 2.12. Such descriptions can be written out in log-space if the gates in 
the monotone circuit are numbered properly. (See Problem 8.20.) The reduction must also 
write out the values of variables of the original circuit and their complements. ■ 

The class of P-complete problems is very rich. Space limitations require us to limit our 
treatment of this subject to two more problems. We now show that LINEAR INEQUALITIES 
described below is P-complete. LINEAR INEQUALITIES is important because it is directly re- 
lated to LINEAR PROGRAMMING, which is widely used to characterize optimization problems. 
The reader is asked to show that LINEAR PROGRAMMING is P-complete. (See Problem 8.21.) 

LINEAR INEQUALITIES 

Instance: An integer-valued m x n matrix A and column m-vector b. 

Answer: "Yes" if there is a rational column n-vector x > (all components are non-negative 

and at least one is non-zero) such that Ax < b. 

We show that LINEAR INEQUALITIES is P-hard, that is, that every problem in P can be 
reduced to it in log-space. The proof that LINEAR INEQUALITIES is in P, an important and 
difficult result in its own right, is not given here. (See [165].) 

THEOREM 8.9.2 LINEAR INEQUALITIES is P-hard. 

Proof We give a log-space reduction of CIRCUIT VALUE to LINEAR INEQUALITIES. That 
is, we show that in log-space an instance of CIRCUIT VALUE can be transformed to an in- 
stance of LINEAR INEQUALITIES so that an instance of CIRCUIT VALUE is a "Yes" instance 
if and only if the corresponding instance of LINEAR INEQUALITIES is a "Yes" instance. 

The log-space reduction that we use converts each gate and input in an instance of a 
circuit into a set of inequalities. The inequalities describing each gate are shown below. (An 
equality relation a = b is equivalent to two inequality relations, a < b and b < a.) The 
reduction also writes the equality z = 1 for the output gate z. Since each variable must 
be non-negative, this last condition insures that the resulting vector of variables, x, satisfies 
x > 0. 
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Given an instance of CIRCUIT VALUE, each assignment to a variable is translated into 
an equality statement of the form x% = or X% = 1. Similarly, each AND, OR, and NOT 
gate is translated into a set of inequalities of the form shown above. Logarithmic temporary 
space suffices to hold gate numbers and to write these inequalities because the number of 
bits needed to represent each gate number is logarithmic in the length of an instance of 
CIRCUIT VALUE. 

To see that an instance of CIRCUIT VALUE is a "Yes" instance if and only if the instance 
of LINEAR INEQUALITIES is also a "Yes" instance, observe that inputs of or 1 to a gate 
result in the correct output if and only if the corresponding set of inequalities forces the 
output variable to have the same value. By induction on the size of the circuit instance, the 
values computed by each gate are exactly the same as the values of the corresponding output 
variables in the set of inequalities. ■ 

We give as our last example of a P-complete problem DTM ACCEPTANCE, the problem 
of deciding if a string is accepted by a deterministic Turing machine in a number of steps 
specified as a unary number. (The integer k is represented as a unary number by a string of k 
characters.) For this problem it is more convenient to give a direct reduction from all problems 
in P to DTM ACCEPTANCE. 

DTM ACCEPTANCE 

Instance: A description of a DTM M, a string w, and an integer n written in unary. 
Answer: "Yes" if and only if M, when started with input w, halts with the answer "Yes" in 
at most n steps. 

THEOREM 8.9.3 DTM ACCEPTANCE is V -complete. 

Proof To show that DTM ACCEPTANCE is log-space complete for P, consider an arbitrary 
problem V in P and an arbitrary instance of V , namely x. There is some Turing machine, 
say M-p, that accepts instances x oiV of length n in time p(n), p a polynomial. We assume 
that p is included with the specification of M-p. For example, if p(y) = 2y + 3y 2 + 1, we 
can represent it with the string ((2,4), (3,2), (1,0)). The log-space Turing machine that 
translates AI-p and x into an instance of DTM ACCEPTANCE writes the description of M-p 
together with the input x and the value of p(n) in unary. Constant temporary space suffices 
to move the descriptions of M-p and x to the output tape. To complete the proof we need 
only show that 0(log n) temporary space suffices to write the value in p(n) in unary, where 
n is the length of a;. 
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Since the length of the input x is provided in unary, that is, by the number of characters 
it contains, its length n can be written in binary on a work tape in space O(logn) by 
counting the number of characters in x. Since it is not difficult to show that any power of 
a fc-bit binary number can be computed by a DTM in work space O(k), it follows that any 
fixed polynomial in n can be computed by a DTM in work space O(k) = 0(log n). (See 
Problem 8.18.) 

To show that DTM ACCEPTANCE is in P, we design a Turing machine that accepts the 
"Yes" instances in polynomial time. This machine copies the unary string of length n to one 
of its work tapes. Given the description of the DTM M, it simulates M with a universal 
Turing machine on input w. When it completes a step, it advances the head on the work 
tape containing n in unary, declaring the instance of DTM ACCEPTANCE accepted if M 
terminates without using more than n steps. By definition, it will complete its simulation of 
M in at most n of Ms steps each of which uses a constant number of steps on the simulating 
machine. That is, it accepts a "Yes" instance of DTM ACCEPTANCE in time polynomial in 
the length of the input. ■ 



8.10 NP-Complete Problems 



As mentioned above, the NP-complete problems are the problems in NP that are the most 
difficult to solve. We have shown that NP C PSPACE C EXPTIME or that every problem in 
NP, including the NP-complete problems, can be solved in exponential time. Since the NP- 
complete problems are the hardest problems in NP, each of these is at worst an exponential- 
time problem. Thus, we know that the NP-complete problems require either polynomial or 
exponential time, but we don't know which. 

The CIRCUIT SAT problem is to determine from a description of a circuit whether it can 
be satisfied; that is, whether values can be assigned to its inputs such that the circuit output 
has value 1 . As mentioned above, this is our canonical NP-complete problem. 

CIRCUIT SAT 

Instance: A circuit description with n input variables {x\, xi, ■ ■ ■ , x n } for some integer n 

and a designated output gate. 

Answer: "Yes" if there is an assignment of values to the variables such that the output of the 

circuit has value 1 . 

As shown in Section 3.9.6, CIRCUIT SAT is an NP-complete problem. The goal of this 
problem is to recognize the "Yes" instances of CIRCUIT SAT, instances for which there are 
values for the input variables such that the circuit has value 1 . 

In Section 3.9.6 we showed that CIRCUIT SAT described above is NP-complete by demon- 
strating that for every decision problem V in NP an instance wofV and an NDTM M that 
accepts "Yes" instances of V can be translated by a polynomial-time (actually, a log-space) 
DTM into an instance c of CIRCUIT SAT such that it) is a "Yes" instance of V if and only if c 
is a "Yes" instance of CIRCUIT SAT. 

Although it suffices to reduce problems in NP via a polynomial-time transformation to an 
NP-complete problem, each of the reductions given in this chapter can be done by a log-space 
transformation. We now show that a variety of other problems are NP-complete. 
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8.10.1 NP-Complete Satisfiability Problems 

In Section 3.9.6 we showed that SATISFIABILITY defined below is NP-complete. In this sec- 
tion we demonstrate that two variants of this language are NP-complete by simple extensions 
of the basic proof that CIRCUIT SAT is NP-complete. 

SATISFIABILITY 

Instance: A set of literals X = {x\ , X\ , Xz, %2> ■ ■ ■ > x n , x n } and a sequence of clauses 

C = (ci, Cz, ■ ■ ■ , Cm), where each clause c, is a subset of X. 

Answer: "Yes" if there is a (satisfying) assignment of values for the variables {x\,Xz, ■ ■ ■ , 

x n } over the set B such that each clause has at least one literal whose value is 1. 

The two variants of SATISFIABILITY are 3-SAT, which has at most three literals in each 
clause, and NAESAT, in which not all literals in each clause have the same value. 

3-SAT 

Instance: A set of literals X = \x\,X\, Xz,Xz, ■ ■ ■ , x n , x n }, and a sequence of clauses 

C = (ci, Oi, . . . , c m ), where each clause c; is a subset of X containing at most three 

literals. 

Answer: "Yes" if there is an assignment of values for variables {x\, xj_, ■ ■ ■ , x n } over the set 

B such that each clause has at least one literal whose value is 1 . 

THEOREM 8. 1 0. 1 3-SAT is NP-complete. 

Proof The proof that SATISFIABILITY is NP-complete also applies to 3-SAT because each 
of the clauses produced in the transformation of instances of CIRCUIT SAT has at most three 
literals per clause. ■ 

NAESAT 

Instance: An instance of 3-SAT. 

Answer: "Yes" if each clause is satisfiable when not all literals have the same value. 

NAESAT contains as its "Yes" instances those instances of 3-SAT in which the literals in 
each clause are not all equal. 

THEOREM 8. 1 0.2 NAESAT is NV-complete. 

Proof We reduce CIRCUIT SAT to NAESAT using almost the same reduction as for 3-SAT. 
Each gate is replaced by a set of clauses. (See Fig. 8.11.) The only difference is that we 
add the new literal y to each two-literal clause associated with AND and OR gates and to 
the clause associated with the output gate. Clearly, this reduction can be performed in de- 
terministic log-space. Since a "Yes" instance of NAESAT can be verified in nondeterministic 
polynomial time, NAESAT is in NP. We now show that it is NP-hard. 

Given a "Yes" instance of CIRCUIT SAT, we show that the instance of 3-SAT is a "Yes" 
instance. Since every clause is satisfied in a "Yes" instance of CIRCUIT SAT, every clause of 
the corresponding instance of NAESAT has at least one literal with value 1 . The clauses that 
don't contain the literal y by their nature have not all literals equal. Those containing y can 
be made to satisfy this condition by setting y to 0, thereby providing a "Yes" instance of 
NAESAT. 

Now consider a "Yes" instance of NAESAT produced by the mapping from CIRCUIT 
SAT. Replacing every literal by its complement generates another "Yes" instance of NAESAT 
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Figure 8. 1 I A reduction from CIRCUIT SAT to NAESAT is obtained by replacing each gate 
in a "Yes" instance of CIRCUIT SAT by a set of clauses. The clauses used in the reduction from 
CIRCUIT SAT to 3-SAT (see Section 3.9.6) are those shown above with the literal y removed. In 
the reduction to NAESAT the literal y is added to the 2-literal clauses used for AND and OR gates 
and to the output clause. 



since the literals in each clause are not all equal, a property that applies before and after 
complementation. In one of these "Yes" instances y is assigned the value 0. Because this is a 
"Yes" instance of NAESAT, at least one literal in each clause has value 1; that is, each clause 
is satisfiable. This implies that the original CIRCUIT SAT problem is satisfiable. It follows 
that an instance of CIRCUIT SAT has been translated into an instance of NAESAT so that the 
former is a "Yes" instance if and only if the latter is a "Yes" instance. ■ 

8.10.2 Other NP-Complete Problems 

This section gives a sampling of additional NP-complete problems. Following the format of 
the previous section, we present each problem and then give a proof that it is NP-complete. 
Each proof includes a reduction of a problem previously shown NP-complete to the current 
problem. The succession of reductions developed in this book is shown in Fig. 8.12. 

INDEPENDENT SET 

Instance: A graph G = (V, E) and an integer k. 

Answer: "Yes" if there is a set of k vertices of G such that there is no edge in E between 

them. 

THEOREM 8. 1 0.3 INDEPENDENT SET is NP '-complete. 

Proof INDEPENDENT SET is in NP because an NDTM can propose and then verify in 
polynomial time a set of fc independent vertices. We show that INDEPENDENT SET is NP- 
hard by reducing 3-SAT to it. We begin by showing that a restricted version of 3-SAT, one 
in which each clause contains exactly three literals, is also NP-complete. If for some variable 
x, both x and x are in the same clause, we eliminate the clause since it is always satisfied. 
Second, we replace each 2-literal clause (a V b) with the two 3-literal clauses (a V b V z) and 
(a V b V z), where z is a new variable. Since z is either or 1, if all clauses are satisfied then 
(a V b) has value 1 in both causes. Similarly, a clause with a single literal can be transformed 
to one containing three literals by introducing two new variables and replacing the clause 
containing the single literal with four clauses each containing three literals. Since adding 
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Figure 8. 1 2 The succession of reductions used in this chapter. 



distinct new variables to each clause that contains fewer than three literals can be done in 
log-space, this new problem, which we also call 3-SAT, is also NP-complete. 

We now construct an instance of INDEPENDENT SET from this new version of 3-SAT 
in which k is equal to the number of clauses. (See Fig. 8.13.) Its graph G has one triangle 
for each clause and vertices carry the names of the three literals in a clause. G also has an 
edge between vertices carrying the labels of complementary literals. 

Consider a "Yes" instance of 3-SAT. Pick one literal with value 1 from each clause. 
This identifies k vertices, one per triangle, and no edge exists between these vertices. Thus, 
the instance of INDEPENDENT SET is a "Yes" instance. Conversely, a "Yes" instance of 
INDEPENDENT SET on G has k vertices, one per triangle, and no two vertices carry the 
label of a variable and its complement because all such vertices have an edge between them. 
The literals associated with these independent vertices are assigned value 1, causing each 
clause to be satisfied. Variables not so identified are assigned arbitrary values. ■ 




•1-3 



Figure 8. 1 3 A graph for an instance of INDEPENDENT SET constructed from the folio 

instance of 3-SAT: (pCj V x 2 V X3) A (W\ Vi 2 V x 3 ) A {x\ V x 2 V £3). 
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3-COLORING 

Instance: The description of a graph G = (V,E). 

Answer: "Yes" if there is an assignment of three colors to vertices such that adjacent vertices 

have different colors. 

THEOREM 8. 1 0.4 3-COLORING is NF '-complete. 

Proof To show that 3-COLORING is in NP, observe that a three-coloring of a graph can 
be proposed in nondeterministic polynomial time and verified in deterministic polynomial 
time. 

We reduce NAESAT to 3-COLORING. Recall that an instance of NAESAT is an instance 
of 3-SAT. A "Yes" instance of NAESAT is one for which each clause is satisfiable with not 
all literals equal. Let an instance of NAESAT consist of m clauses C = [c\, c%, . . . , c m ) 
containing exactly three literals from the set X = {x\,x\,X2,Xi, ■ ■ ■ > x n , x n } of literals in 
n variables. (Use the technique introduced in the proof of Theorem 8.10.3 to insure that 
each clause in an instance of 3-SAT has exactly three literals per clause.) 

Given an instance of NAESAT, we construct a graph G in log-space and show that this 
graph is three-colorable if and only if the instance of NAESAT is a "Yes" instance. 

The graph G has a set of n variable triangles, one per variable. The vertices of the 
triangle associated with variable Xi are {v, Xi, Xi}. (See Fig. 8.14.) Thus, all the variable 
triangles have one vertex in common. For each clause containing three literals we construct 
one clause triangle per clause. If clause Cj contains literals Xj 1 , A j2 , and Xj i , its associated 
clause triangle has vertices labeled (j, \j,), (j, Xj 2 ), and (j, A j3 ). Finally, we add an edge 
between the vertex (j, Xj k ) and the vertex associated with the literal Xj k . 

We now show that an instance of NAESAT is a "Yes" instance if and only if the graph G 
is three-colorable. Suppose the graph is three-colorable and the colors are {0, 1,2}. Since 



Variable Triangle 




Clause Triangle 



Figure 8.14 A graph G corresponding to the clauses Ci = {xi,X2,x^} and c-i — {xi,X2,Xi} 
in an instance of NAESAT. It has one variable triangle for each variable and one clause triangle for 
each clause. 



360 Chapter 8 Complexity Classes Models of Computation 

three colors are needed to color the vertices of a triangle and the variable triangles have a 
vertex labeled v in common, assume without loss of generality that this common vertex has 
color 2. The other two vertices in each variable triangle are assigned value or 1, values we 
give to the associated variable and its complement. 

Consider now the coloring of clause triangles. Since three colors are needed to color 
vertices of a clause triangle, consider vertices with colors and 1 . The edges between these 
clause vertices and the corresponding vertices in variable triangles have different colors at 
each end. Let the literals in the clause triangles be given values that are the Boolean comple- 
ment of their colors. This provides values for literals that are consistent with the values of 
variables and insures that not all literals in a clause have the same value. The third vertex in 
each triangle has color 2. Give its literal a value consistent with the value of its variable. It 
follows that the clauses are a "Yes" instance of NAESAT. 

Suppose, on the other hand, that a set of clauses is a "Yes" instance of NAESAT. We 
show that the graph G is three-colorable. Assign color 2 to vertex v and colors and 1 to 
vertices labeled Xi and Xi based on the values of these literals in the "Yes" instance. Consider 
two literals in clause Cj that are not both satisfied. If X{ (xt) is one of these, give the vertex 
labeled (j, Xj) ((j, Xi)) the value that is the Boolean complement of the color of Xi (xt) in 
its variable triangle. Do the same for the other literal. Since the third literal has the same 
value as one of the other two literals (they have different values), let its vertex have color 2. 
Then G is three-colorable. Thus, G is a "Yes" instance of 3-COLORING if and only if the 
corresponding set of clauses is a "Yes" instance of NAESAT. ■ 

EXACT COVER 

Instance: A set S = {ui, 112, ■ ■ ■ , u p } and a family {Si, S2, . . . , S„} of subsets of S. 

Answer: "Yes" if there are disjoint subsets Sj lt Sj 2 , . . . , Sj t such that [ Ji<i<tSj i = S. 

THEOREM 8. 1 0.5 EXACT COVER is NP -complete. 

Proof It is straightforward to show that EXACT COVER is in NP. An NDTM can simply 
select the subsets and then verify in time polynomial in the length of the input that these 
subsets are disjoint and that they cover the set S. 

We now give a log-space reduction from 3-COLORING to EXACT COVER. Given an 
instance of 3-COLORING, that is, a graph G = (V, E), we construct an instance of EXACT 
COVER, namely, a set S and a family of subsets of S such that G is a "Yes" instance of 
3-COLORING if and only if the family of sets is a "Yes" instance of EXACT COVER. 

As the set S we choose S = ^U{< e,i > \e £ E, < i < 2} and as the family 
of subsets of S we choose the sets S v j and R e> { defined below for v G V, e G E and 
< i < 2: 

S v ,i = {v} U {< e, i > I e is incident on v G V} 
Re,i = {< e,i >} 

Let G be three-colorable. Then let c v , an integer in {0, 1, 2}, be the color of vertex v. 
We show that the subsets S VtCv for v G V and R e j for < e,i > G' S v>Cv for any v G V 
are an exact cover. If e = (v, w) G E, then c v =/= c w and S 1 , >Cu and S WiCw are disjoint. By 
definition the sets R Ei i are disjoint from the other sets. Furthermore, every element of S is 
in one of these sets. 

On the other hand, suppose that S has an exact cover. Then, for each v G V, there is a 
unique c„, < c v < 2, such that v G S ViCv . To show that G has a three-coloring, assume 
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that it doesn't and establish a contradiction. Since G doesn't have a three-coloring, there is 
an edge e = (v,w) such that c v = c w , which contradicts the assumption that S has an 
exact cover. It follows that G has a three-coloring if and only if S has an exact cover. ■ 

SUBSET SUM 

Instance: A set Q = {a,\, az, . . . , a n } of positive integers and a positive integer d. 

Answer: "Yes" if there is a subset of Q that adds to d. 

THEOREM 8. 1 0.6 SUBSET SUM is NP -complete. 

Proof SUBSET SUM is in NP because a subset can be nondeterministically chosen in time 
equal to n and an accepting choice verified in a polynomial number of steps by adding up 
the chosen elements of the subset and comparing the result to d. 

To show that SUBSET SUM is NP-hard, we give a log-space reduction of EXACT COVER 
to it. Given an instance of EXACT COVER, namely, a set S = {ui, Uz, ■ ■ ■ , %>} and a family 
{Si, S2, ■ • ■ , S n } of subsets of S, we construct the instance of SUBSET SUM characterized 
as follows. We let /3 = n + 1 and d = {3 n ~ l + p n ~ 2 + ■ ■ ■ + (3° = (/?" - 1) /(/3 - 1). We 
represent the element Ui £ S by the integer /3 , 1 < i < n, and represent the set Sj by 
the integer a,j that is the sum of the integers associated with the elements contained in Sj . 
For example, if p = n = 3, S\ = {1*1,7/3}, S% = {u\,U2}, and S3 = {^2}. we represent 
Si by a-i = f3 2 + (3°, 52 by a2 = /3 + (3°, and 53 by 03 = f3. Since S\ and S3 forms an 
exact cover of S, a\ + 03 = (3 1 + (3 + 1 = d. 

Thus, given an instance of EXACT COVER, this polynomial-time transformation pro- 
duces an instance of SUBSET SUM. We now show that the instance of the former is a "Yes" 
instance if and only if the instance of the latter is a "Yes" instance. To see this, observe that 
in adding the integers corresponding to the sets in an EXACT COVER in base (3 there is no 
carry from one power of (3 to the next. Thus the coefficient of (3 is exactly the number 
of times that u^+i appears in each of the sets corresponding to a set of subsets of S. The 
subsets form a "Yes" instance of EXACT COVER exactly when the corresponding integers 
contain each power of (3 exactly once, that is, when the integers sum to d. ■ 

TASK SEQUENCING 

Instance: Positive integers t\, t%, . . . , t r , which are execution times. d\, di, . . . , d r , which 

are deadlines, Pi,P2, ■ ■ ■ ,Pr> which are penalties, and integer k > 1. 

Answer: "Yes" if there is a permutation n of {1, 2, . . . , r} such that 



if *tt(i) + M2) ^ !~ Mi) > d -K(j) tbenJMi) else 0] < k 



THEOREM 8. 1 0.7 TASK SEQUENCING is NP '-complete. 

Proof TASK SEQUENCING is in NP because a permutation 7r for a "Yes" instance can be 
verified as a satisfying permutation in polynomial time. We now give a log-space reduction 
of SUBSET SUM to TASK SEQUENCING. 

An instance of SUBSET SUM is a positive integer d and a set Q = {d\, (i2, ... , a„} of 
positive integers. A "Yes" instance is one such that a subset of Q adds to d. We translate 
an instance of SUBSET SUM to an instance of TASK SEQUENCING by setting r = n, 
ti = Pi = flj, di = d, and k = fy. a;) — d. Consider a "Yes" instance of this TASK 
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SEQUENCING problem. Then the following inequality holds: 



if a ff (i) + 0,^(2) + • ■ ■ + a w(j) > d, then a^j) else 0] I < k 



Let q be the expression in parentheses in the above inequality. Then q = a ff (; +1 ) + 0,^(1+2) 
+ • ■ ■ + a T („), where I is the integer for which p = a w n\ + GW2) + ■ • • + (Wi) < d and 
p + a T (; +1 ) > d. By definition p + q = J2i a i- It follows that 5 > £V a, — d. Since 
q < k = J2 i ai — d, we conclude that p = d or that the instance of TASK SEQUENCING 
corresponds to a "Yes" instance of SUBSET SUM. Similarly, consider a "Yes" instance of 
SUBSET SUM. It follows from the above argument that there is a permutation such that the 
instance of TASK SEQUENCING is a "Yes" instance. ■ 

The following NP-complete problem is closely related to the P-complete problem LINEAR 
INEQUALITIES. The difference is that the vector x must be a 0-1 vector in the case of 0-1 
INTEGER PROGRAMMING, whereas in LINEAR INEQUALITIES it can be a vector of rationals. 
Thus, changing merely the conditions on the vector x elevates the problem from P to NP and 
makes it NP-complete. 

0-1 INTEGER PROGRAMMING 

Instance: An n x m matrix A and a column n-vector b, both over the ring of integers for 

integers n and m. 

Answer: "Yes" if there is a column m-vector x over the set {0, 1} such that Ax = b. 

THEOREM 8. 10.8 0-1 INTEGER PROGRAMMING is NP-complete. 

Proof To show that 0-1 INTEGER PROGRAMMING is in NP, we note that a 0-1 vector x 
can be chosen nondeterministically in n steps, after which verification that it is a solution to 
the problem can be done in 0(n ) steps on a RAM and 0(n ) steps on a DTM. 

To show that 0-1 INTEGER PROGRAMMING is NP-hard we give a log-space reduc- 
tion of 3-SAT to it. Given an instance of 3-SAT, namely, a set of literals X = (x\, 
X\,X2, X2, ■ ■ ■ , x n , x n ) and a sequence of clauses C = (ci, C2, . . . , c m ), where each clause Ci 
is a subset of X containing at most three literals, we construct anmxp matrix A = [B | C], 
where B = [bij] for 1 < i, j < n and C = [c r , s ] for 1 < r < n and 1 < s < pm. We 
also construct a column p-vector d as shown below, wherep = (m+ \)n. The entries of B 
and C are defined below. 



hj 



1 if Xj G Ci for 1 < j < n 
— 1 if Xj G Ci for 1 < j < n 

(—1 if (r — \)n + 1 < s < rn 
otherwise 

Since no one clause contains both Xj and Xj, this definition of at 

We also let di, the ith component of d, satisfy di = 1 — q%, where qi is the number of 

complemented variables in c^. Thus, the matrix A has the form given below, where B is an 

mxn matrix and each row of A contains n instances of — 1 outside of B in non-overlapping 
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columns: 



B 











: : ... 

... ... -1 ... -1 

We show that the instance of 3-SAT is a "Yes" instance if and only if this instance of 0-1 
INTEGER PROGRAMMING is a "Yes" instance, that is, if and only if Ax = d. 

We write the column p-vector x as the concatenation of the column m-vector u and 
the column mn-vector v. It follows that Ax = b if and only if Au > b. Now consider 
the ith component of Ait. Let u select fcj uncomplemented and l± complemented variables 
of clause c^. Then, Au > b if and only if hi — li > di = 1 — q, or ki + (q^ — li) > 1 
for all i. Now let Xi = Ui for 1 < i < n. Then ki and qi — li are the numbers of 
uncomplemented and complemented variables in Cj that are set to 1 and 0, respectively. 
Since ki + (qi — li) > 1, Ci is satisfied, as are all clauses, giving us the desired result. ■ 

8.11 The Boundary Between P and NP 

It is important to understand where the boundary lies between problems in P and the NP- 
complete problems. While this topic is wide open, we shed a modest amount of light on it by 
showing that 2-SAT, the version of 3-SAT in which each clause has at most two literals, lies on 
the P-side of this boundary, as shown below. In fact, it is in NL, which is in P. 

THEOREM 8. 1 I . I 2-SAT is in NL. 

Proof Given an instance / of 2-SAT, we first insure that each clause has exactly two distinct 
literals by adding to each one-literal clause a new literal z that is not used elsewhere. We 
then construct a directed graph G = (V, E) with vertices V labeled by the literals x and x 
for each variable x appearing in /. This graph has an edge (a, (3) in E directed from vertex 
a to vertex j3 if the clause (a V /3) is in /. If (a V (i) is in /, so is (/? V a) because of 
commutativity of V. Thus, if (a, j3) € E, then ((3, a) € E also. (See Fig. 8.15.) Note 
that (a, f3) ^ (fi,a) because this requires that (3 = a, which is not allowed. Let a/). 
It follows that if there is a path from a to 7 in G, there is a distinct path from 7 to a 
obtained by reversing the directions of each edge on the path and replacing the literals by 
their complements. 

To understand why these edges are chosen, note that if all clauses of / are satisfied and 
(a V (3) is in I, then a = 1 implies that (3=1. This implication relation, denoted a => (3, 
is transitive. If there is a path (ai, 0.2, ■ ■ ■ , afc) in G, then there are clauses (ai V 02), 
(a.2 V a-j), . . . , (cJfc_i V afc) in /. If all clauses are satisfied and if the literal a.\ = 1, then 
each un-negated literal on this path must have value 1 . 

We now show that an instance / is a "No" instance if and only if there is a variable x 
such that there is a path in G from a; to a; and one from x to x. 

If there is a variable x such that such paths exists, this means that x => x and x => x 
which is a logical contradiction. This implies that the instance / is a "No" instance. 

Conversely, suppose / is a "No" instance. To prove there is a variable x such that there 
are paths from vertex x to vertex x and from x to x, assume that for no variable x does this 
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x 3 

Figure 8. 1 5 A graph capturing the implications associated with the following satisfiable instance 
of 2-SAT: (x 3 V x%) A (x 3 V X\) A (x 3 V x 2 ) A (x x V x 2 ) A (x } V x{). 



condition hold and show that / is a "Yes" instance, that is, every clause is satisfied, which 
contradicts the assumption that / is a "No" instance. 

Identify a variable that has not been assigned a value and let a be one of the two cor- 
responding literals such that there is no directed path in G from the vertex a to a. (By 
assumption, this must hold for at least one of the two literals associated with x.) Assign 
value 1 to a and each literal A reachable from it. (This assigns values to the variables iden- 
tified by these literals.) If these assignments can be made without assigning a variable both 
values and 1, each clause can be satisfied and I is "Yes" instance rather than a "No" one, as 
assumed. To show that each variable is assigned a single value, we assume the converse and 
show that the conditions under which values are assigned to variables by this procedure are 
contradicted. A variable can be assigned contradictory values in two ways: a) on the current 
step the literals A and A are both reachable from a and assigned value 1 , and b) a literal A 
is reachable from a on the current step that was assigned value on a previous step. For 
the first case to happen, there must be a path from a to vertices A and A. By design of the 
graph, if there is a path from a to A, there is a path from A to a. Since there is a path from 
a to A, there must be a path from a to a, contradicting the assumption that there are no 
such paths. In the second case, let a A be assigned 1 on the current step that was assigned 
on a previous step. It follows that A was given value 1 on that step. Because there is a path 
from a to A, there is one from A to a and our procedure, which assigned A value 1 on the 
earlier step, must have assigned a value 1 on that step also. Thus, a had the value before 
the current step, contradicting the assumption that it was not assigned a value. 

To show that 2-SAT is in NL, recall that NL is closed under complements. Thus, it suf- 
fices to show that "No" instances of 2-SAT can be accepted in nondeterministic logarithmic 
space. By the above argument, if / is a "No" instance, there is a variable x such that there is 
a path in G from xtox and from x to x. Since the number of vertices in G is at most linear 
in n, the length of/ (it may be as small as 0(\/n)), an NDTM can propose and then verify 
in space 0(log n) a path in G from itoit and back by checking that the putative edges are 
edges of G, that x is the first and last vertex on the path, and that x is encountered before 
the end of the path. ■ 
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8.12 PSPACE-Complete Problems 

PSPACE is the class of decision problems that are decidable by a Turing machine in space poly- 
nomial in the length of the input. Problems in PSPACE are potentially much more complex 
than problems in P. 

The hardest problems in PSPACE are the PSPACE-complete problems. (See Section 8.8.) 
Such problems have two properties: a) they are in PSPACE and b) every problem in PSPACE 
can be reduced to them by a polynomial-time Turing machine. The PSPACE-complete prob- 
lems are the hardest problems in PSPACE in the sense that if they are in P, then so are all 
problems in PSPACE, an unlikely prospect. 

We now establish that QUANTIFIED SATISFIABILITY defined below is PSPACE-complete. 
We also show that GENERALIZED GEOGRAPHY, a game played on a graph, is PSPACE- 
complete by reducing QUANTIFIED SATISFIABILITY to it. A characteristic shared by many 
important PSPACE-complete problems and these two problems is that they are equivalent to 
games on graphs. 

8. 12. 1 A First PSPACE-Complete Problem 

Quantified Boolean formulas use existential quantification, denoted 3, and universal quan- 
tification, denoted V. Existential quantification on variable X\, denoted 3x\, means "there 
exists a value for the Boolean variable X\," whereas universal quantification on variable x%, 
denoted \/x 2 , means "for all values of the Boolean variable x%? Given a Boolean formula such 
as (x\ V Xj V 5%) A (x\ V i2 V 13) A (x~i V x 2 V X3), a quantification of it is a collection of 
universal or existential quantifiers, one per variable in the formula, followed by the formula. 
For example, 

V.Ti3x2V2)3[(2:i V12V 2T3) A (x\ V x 2 V X3) A {x\ V x 2 V X3)} 

is a quantified formula. Its meaning is "for all values of X\, does there exist a value for X2 such 
that for all values of a; 3 the formula (x-i Wx 2 \/x i ) A(x-i Vx 2 Va^) A {x x \/x 2 Va^) is satisfied?" 
In this case the answer is "No" because for X\ = 1, the function is not satisfied with £3 = 
when x 2 = and is not satisfied with X3 = 1 when x 2 = \. However, if the third quantifier 
is changed from universal to existential, then the quantified formula is satisfied. Note that the 
order of the quantifiers is important. To see this, observe that under the quantification order 
Wx{ix^3x 2 that the quantified formula is satisfied. 

QUANTIFIED SATISFIABILITY consists of satisfiable instances of quantified Boolean for- 
mulas in which each formula is expressed as a set of clauses. 

QUANTIFIED SATISFIABILITY 

Instance: A set of literals X = {x\, X\, x 2 ,x 2 , ..., x n ,x n }, a sequence of clauses C = 

(ci, c 2 , . . . , c m ), where each clause Ci is a subset of X, and a sequence of quantifiers 

(Qi,Q 25 ... ) Q„),whereQ j e{V,3}. 

Answer: "Yes" if under the quantifiers Q\X\Q 2 x 2 ■ ■ ■ Q n x n , the clauses C\, c 2 , . . . , c m are 

satisfied, denoted 

Q\XiQ 2 x 2 -- -Q n x n [<j>] 

where the formula <ft = C\Ac 2 A- ■ -Ac m is in the product-of-sums form. (See Section 2.2.) 
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In this section we establish the following result, stronger than PSPACE-completeness of 
QUANTIFIED SATISFIABILITY: we show it is complete for PSPACE under log-space trans- 
formations. Reductions of this type are potentially stronger than polynomial-time reductions 
because the transformation is executed in logarithmic space, not polynomial time. While it 
is true that every log-space transformation is a polynomial-time transformation (see Theo- 
rem 8.5.8), it is not known if the reverse is true. We prove this result in two stages: we first 
show that QUANTIFIED SATISFIABILITY is in PSPACE and then that it is hard for PSPACE. 

LEMMA 8. 1 2. 1 QUANTIFIED SATISFIABILITY is in PSPACE. 

Proof To show that QUANTIFIED SATISFIABILITY is in PSPACE we evaluate in polyno- 
mial space a circuit, C qsa t, whose value is 1 if and only if the instance of QUANTIFIED 
SATISFIABILITY is a "Yes" instance. The circuit C qsa t is a tree all of whose paths from the 
inputs to the output (root of the tree) have the same length, each vertex is either an AND 
gate or an OR gate, and each input has value or 1. (See Fig. 8.16.) The gate at the root of 
the tree is associated with the variable X\, the gates at the next level are associated with Xj, 
etc. The type of gate at the jth level is determined by the jth quantifier Qj and is AND if 
Qj = V and OR if Qj = 3. The leaves correspond to all 2™ the values of the n variables: 
at each level of the tree the left and right branches correspond to the values and 1 for the 
corresponding quantified variable. Each leaf of the tree contains the value of the formula <j) 
for the values of the variables leading to that leaf. In the example of Fig. 8.16 the leftmost 
leaf has value 1 because on input X\ = X2 = x$ = each of the three clauses {x\, X2, x$}, 
{xi,X2,Xi} and {x\,X2,X}} is satisfied. 

It is straightforward to see that the value at the root of the tree is 1 if all clauses are 
satisfied under the quantifiers Q1X1Q2X2 • • • Q n %n and otherwise. Thus, the circuit solves 
the QUANTIFIED SATISFIABILITY problem and its complement. (Note that PSPACE = 
coPSPACE, as shown in Theorem 8.6.1.) 




■''1 



■r 2 



1 

/\ 

1 



1 



Figure 8.16 A tree circuit constructed from the instance VrEiBa^Va^ for (j) = (x\ ViiV 
X 3 ) A (xi Vi 2 V £3) A (xi Vi 2 V 23) of QUANTIFIED SATISFIABILITY. The eight values at 
the leaves of the tree are the values of (j> on the eight different assignments to [x\, X2, X}). 
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tree_eval (n, (ft,Q,d,w); 
±fd=n then 

return (evaluate (0, iiO); 
else 

if first(Q) = 3 then 

return(tree_eval(n, (ft, rest(Q), d+l,wO) 
OR tree_eval(n, (ft, rest(Q), d+l,wl)); 
else 

return (tree_eval (n, (ft, rest(Q), d+l,wO) 
AND tree_eval(n, (ft, rest(Q), d+l,wl)); 

Figure 8. 1 7 A program for the recursive procedure tree_eval(?i, (ft, Q, d, w). The tuple it) 
keeps track of the path taken into the tree. 



The circuit C qsa t has size exponential in n because there are 2™ values for the n variables. 
However, it can be evaluated in polynomial space, as we show. For this purpose consider the 
recursive procedure tree_eval (n, (ft, Q, d, w) in Fig. 8.17 that evaluates C qsat . Here n is 
the number of variables in the quantization, d is the depth of recursion, (ft is the expression 
over which quantification is done, Q is a sequence of quantifiers, and w holds the values for 
d variables. Also, first (Q) and rest (Q) are the first and all but the first components of 
Q, respectively. When d = 0, Q = (Qi, Qi, ■ ■ ■ , Q n ) and Q\X\Q2X2 • • • Q n %n <ft is the 
expression to evaluate. We show that tree_eval (n, (ft, Q, 0, e) can be computed in space 
quadratic in the length of an instance of QUANTIFIED SATISFIABILITY. 

When d = n, the procedure has reached a leaf of the tree and the string w contains 
values for the variables Xi,Xz, ■ ■ ■ , X n , in that order. Since all variables of (ft are known when 
d = n, (ft can be evaluated. Let evaluate {(ft, w) be the function that evaluates (ft with values 
specified by w. Clearly tree_eval(n, (ft, Q, 0, e) is the value of Q1X1Q2X2 • • ■ Q n %n ft- 

We now determine the work space needed to compute tree_eval (n, (ft, Q, d, w) on 
a DTM. (The discussion in the proof of Theorem 8.5.5 is relevant.) Evaluation of this 
procedure amounts to a depth-first traversal of the tree. An activation record is created for 
each call to the procedure and is pushed onto a stack. Since the depth of the tree is n, at most 
n + 1 records will be on the stack. Since each activation record contains a string of length at 
most 0(n), the total space used is 0(n 2 ). And the length of QiXxQ 2 X2 ■ ■ ■ Q n x n (ft is at 
least n, the space is polynomial in the length of this formula. ■ 

LEMMA 8. 1 2.2 QUANTIFIED SATISFIABILITY is log-space hard for PSPACE. 



Proof Our goal is to show that every decision problem V G PSPACE can be reduced in 
log-space to an instance of QUANTIFIED SATISFIABILITY. Instead, we show that every such 
V can be reduced in log-space to a "No" instance of QUANTIFIED SATISFIABILITY (we call 
this QUANTIFIED UNSATISFIABILITY). But a "No" instance is one for which the formula 
(ft, which is in product-of-sums form, is not satisfied under the specified quantification or 
that its Boolean complement, which is in sum-of-products expansion (SOPE) form, is sat- 
isfied under a quantification in which V is replaced by 3 and vice versa. Exchanging "Yes" 
and "No" instances of decision problems (which we can do since PSPACE is closed un- 
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der complements), we have that every problem in coPSPACE can be reduced in log-space 
to QUANTIFIED SATISFIABILITY. However, since PSPACE = coPSPACE, we have the 
desired result. 

Our task now is to show that every problem V £ PSPACE can be reduced in log-space 
to an instance of QUANTIFIED UNSATISFIABILITY. Let L £ PSPACE be the language 
of "Yes" instances of V and let M be the DTM deciding L. Instances of QUANTIFIED 
UNSATISFIABILITY will be quantified formulas in SOPE form that describe conditions on 
the configuration graph G(M,w) of M on input w. We associate a Boolean vector with 
each vertex in G(M, w) and assume that G(M, w) has one initial and final vertex associated 
with the vectors a and b, respectively. (We can make the last assumption because M can be 
designed to enter a cleanup phase in which it prints blanks in all non-blank tape cells.) 

Let c and d be vector encodings of arbitrary configurations c and d of G(M, w). We 
construct formulas tpi(c,d), < i < k, in SOPE form that are satisfied if and only if 
there exists a path from c to d in G(M,w) of length at most 2 % (it computes the predi- 
cate PATH(c, d, 2 l ) introduced in the proof of Theorem 8.5-5). Then a "Yes" instance of 
QUANTIFIED UNSATISFIABILITY is the formula ipk(a,b), where a and b are encodings 
of the initial and final vertices of G(M, w) for k sufficiently large that a polynomial-space 
computation can be done in time 2 . Since, as seen in Theorem 8.5.6, a deterministic com- 
putation in space S is done in time 0{2 ), it suffices for k to be polynomial in the length 
of the input. 

The formula ipo( c > d,) is satisfiable if either c = d or d follows from c in one step. Such 
formulas are easily computed from the descriptions of M and w. tpi(c, d) can be expressed 
as shown below, where the existential quantification is over all possible intermediate config- 
urations e of M. (See the proof of Theorem 8.5.5 for the representation of PATH(c, d, 2 l ) 
in terms of PATH(c, e, 2 1 " 1 ) and PATH(e, d, 2 4 " 1 ).) 

tpi(c,d) = 3e h/>i-i(c,e) AVi-i(e,d)] (8.1) 

Note that 3e is equivalent to 3ei3e2 ■ • • 3e g , where q is the length of e. Universal quantifi- 
cation over a vector is expanded in a similar fashion. 

Unfortunately, for i = k this recursively defined formula requires space exponential 
in the size of the input. Fortunately, we can represent tpi(c, d) more succinctly using the 
implication operator x =£> y, as shown below, where x =£> y is equivalent to x V y. Note 
that if x =$■ y is TRUE, then either x is FALSE or x and y are both TRUE. 

^i{c,d) = 3e [\/x\/y [(x = c A y = e) V (as = e A y = d)} => if>i-i(x,y)] (8.2) 

Here x = y denotes (x\ = y\) A (x 2 = yi) A • • ■ A {x q = y q ), where (xi = yi) denotes 
Xiyi V Xi 7 y i . Then, the formula in the outer square brackets of (8.2) is true when either 
(a; = cAy = e)V(a; = eAy = d)is FALSE or this expression is TRUE and ipi-i{ x > y) is 
also TRUE. Because the contents of the outer square brackets are TRUE, the quantization on 
x and y requires that ipi-i(c, e) and "0i_i(e, d) both be TRUE or that the formula given 
in (8.1) be satisfied. 

It remains to convert the expression for ipi(c, d) given above to SOPE form in log-space. 
But this is straightforward. We replace g =^- hbygVh, where g = (rAs)V(iAu) and 
r = (x = c), s = (y = e), t = (x = e), and u = (y = d). It follows that 

g= (TVs) A(tVu) 
= (r A t) V (r A u) V (s A t) V (s A u) 
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Since each of r, s, t, and u can be expressed as a conjunction of q terms of the form 
(xj = yj) and (xj = yfj = (xjTJj V xfyf), 1 < i < q, it follows that r, s, t, and u 
can each be expressed as a disjunction of 2q terms. Each of the four terms of the form 
(r A i) consists of Aq terms, each of which is a conjunction of four literals. Thus, g is the 
disjunction of 16q 2 terms of four literals each. 

Given the regular structure of this formula for ipi, it can be generated from a formula for 
ipi—l in space 0(log q). Since < i < k and k is polynomial in the length of the input, all 
the formulas, including that for ipf.* can be generated in log-space. By the above reasoning, 
this formula is a "Yes" instance of QUANTIFIED UNSATISFIABILITY if and only if there is a 
path in the configuration graph G(M, w) between the initial and final states. ■ 

Combining the two results, we have the following theorem. 

THEOREM 8. 1 2. 1 QUANTIFIED SATISFIABILITY is log-space complete for PSPACE. 

8. 12.2 Other PSPACE-Complete Problems 

An important version of QUANTIFIED SATISFIABILITY is ALTERNATING QUANTIFIED SAT- 
ISFIABILITY. 

ALTERNATING QUANTIFIED SATISFIABILITY 

Instance: Instances of QUANTIFIED SATISFIABILITY that have an even number of quanti- 
fiers that alternate between 3 and V, with 3 the first quantifier. 
Answer: "Yes" if the instance is a "Yes" instance of QUANTIFIED SATISFIABILITY. 

THEOREM 8.12.2 ALTERNATING QUANTIFIED SATISFIABILITY is log-space complete for 
PSPACE. 

Proof ALTERNATING QUANTIFIED SATISFIABILITY is in PSPACE because it is a special 
case of QUANTIFIED SATISFIABILITY. We reduce QUANTIFIED SATISFIABILITY to AL- 
TERNATING QUANTIFIED SATISFIABILITY in log-space as follows. If two universal quan- 
tifiers appear in succession, we add an existential quantifier between them in a new variable, 
say Xi, and add the new clause {xi, x{\ at the end of the formula <$>. If two existential quan- 
tifiers appear in succession, add universal quantification over a new variable and a clause 
containing it and its negation. If the number of quantifiers is not even, repeat one or the 
other of the above steps. This transformation at most doubles the number of variables and 
clauses and can be done in log-space. The instance of ALTERNATING QUANTIFIED SATIS- 
FIABILITY is a "Yes" instance if and only if the instance of QUANTIFIED SATISFIABILITY is 
a "Yes" instance. ■ 

The new version of QUANTIFIED SATISFIABILITY is akin to a game in which universal 
and existential players alternate. The universal player attempts to show a fact for all values of 
its Boolean variable, whereas the existential player attempts to deny that fact by the choice of 
its existential variable. It is not surprising, therefore, that many games are PSPACE-complete. 
The geography game described below is of this type. 

The geography game is a game for two players. They alternate choosing names of cities 
in which the first letter of the next city is the last letter of the previous city until one of the two 
players (the losing player) cannot find a name that has not already been used. (See Fig. 8.18.) 
This game is modeled by a graph in which each vertex carries the name of a city and there is 
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an edge from vertex U\ to vertex 112 if the last letter in the name associated with U\ is the first 
letter in the name associated with 1*2 . In general this graph is directed because an edge from 
U\ to U2 does not guarantee an edge from U2 to U\. 

GENERALIZED GEOGRAPHY 

Instance: A directed graph G = (V,E) and a vertex v. 

Answer: "Yes" if there is a sequence of (at most \V |) alternating vertex selections by two 

players such that vertex v is the first selection by the first player and for each selection of 

the first player and all selections of the second player of vertices adjacent to the previous 

selection, the second player arrives at a vertex from which it cannot select a vertex not 

previously selected. 

THEOREM 8. 1 2.3 GENERALIZED GEOGRAPHY is log-space complete for PSPACE. 

Proof To show that GENERALIZED GEOGRAPHY is log-space complete for PSPACE, we 
show that it is in PSPACE and that QUANTIFIED SATISFIABILITY can be reduced to it 
in log-space. To establish the first result, we show that the outcome of GENERALIZED 
GEOGRAPHY can be determined by evaluating a graph similar to the binary tree used to 
show that QUANTIFIED SATISFIABILITY is realizable in PSPACE. 

Given the graph G = (V,E) (see Fig. 8.18(a)), we construct a search graph (see 
Fig. 8.18(b)) by performing a variant of depth-first search of G from v. At each vertex 
we visit the next unvisited descendant, continuing until we encounter a vertex on the cur- 
rent path, at which point we backtrack and try the next sibling of the current vertex, if any. 
In depth-first search if a vertex has been visited previously, it is not visited again. In this 
variant of the algorithm, however, a vertex is revisited if it is not on the current path. The 
length of the longest path in this tree is at most |V| — 1 because each path can contain no 
more than \V\ vertices. The tree may have a number of vertices exponential in \V\. 

At a leaf vertex a player has no further moves. The first player wins if it is the second 
player's turn at a leaf vertex and loses otherwise. Thus, a leaf vertex is labeled 1 (0) if the 
first player wins (loses) . To insure that the value at a vertex u is 1 if the two players reach u 
and the first player wins, we assign OR operators to vertices at which the first player makes 
selections and AND operators otherwise. (The output of a one-input AND or OR gate is the 
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Figure 8. 18 (a) A graph for the generalized geography game and (b) the search tree associated 
with the game in which the start vertex is Marblehead. 
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value of its input.) This provides a circuit that can be evaluated just as was the circuit Cqsat 
used in the proof of Theorem 8.12.1. The "Yes" instances of GENERALIZED GEOGRAPHY 
are such that the first player can win by choosing a first city. In Fig. 8.18 the value of the 
root vertex is 0, which means that the first player loses by choosing to start with Marblehead 
as the first city. 

Vertices labeled AND or OR in the tree generated by depth-first search can have arbitrary 
in-degree because the number of vertices that can be reached from a vertex in the original 
graph is not restricted. The procedure tree_eval described in the proof of Theorem 8.12.1 
can be modified to apply to the evaluation of this DAG whose vertex in-degree is potentially 
unbounded. (See Problem 8.30.) This modified procedure runs in space polynomial in the 
size of the graph G. 

We now show that ALTERNATING QUANTIFIED SATISFIABILITY (abbreviated AQSAT) 
can be reduced in log-space to GENERALIZED GEOGRAPHY. Given an instance of AQSAT 
such as that shown below, we construct an instance of GENERALIZED GEOGRAPHY, as 
shown in Fig. 8.19. We assume without loss of generality that the number of quantifiers is 
even. If not, add a dummy variable and quantify on it: 

3x\ix2^x^ix^[{x\ V X2 V x$) A (xi V12V £3) A (JE\ V x 2 V 5%) A (#4 V X4)] 
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«» /\ 

1 

x 3 y\ 

1 

X A /\ 
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X\ V X 2 VI3 X\ V X2 V X3 X\ V X2 V £3 



X4 V XA 



Figure 8.19 An instance of GENERALIZED GEOGRAPHY corresponding to an instance of 
ALTERNATING QUANTIFIED SATISFIABILITY. 
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The instance of GENERALIZED GEOGRAPHY corresponding to an instance of AQSAT 
is formed by cascading a set of diamond-shaped subgraphs, one per variable (see Fig. 8.19), 
and connecting the bottom vertex b in the last diamond to a set of vertices, one per clause. 
An edge is drawn from a clause to a vertex associated with a literal (xi or Xi) if that literal 
is in the clause. The literal Xi (xi) is associated with the middle vertex on the right-hand 
(left-hand) side of a diamond. Thus, in the example, there is an edge from the leftmost 
clause vertex to the left-hand vertex in the diamond for x$ and to the right-hand vertices in 
diamonds for X\ and x 2 - 

Let the geography game be played on this graph starting with the first player from the 
topmost vertex labeled t. The first player can choose either the left or right path. The second 
player has only one choice, taking it to the bottom of the first diamond, and the first player 
now has only one choice, taking it to the top of the second diamond. The second player 
now can choose a path to follow. Continuing in this fashion, we see that the first (second) 
player can exercise a choice on the odd- (even-) numbered diamonds counting from the top. 
Since the number of quantifiers is even, the choice at the bottom vertex labeled b belongs to 
the second player. Observe that whatever choices are made within the diamonds, the vertices 
labeled m and b are visited. 

Because the goal of each player is to force the other player into a position from which 
it has no moves, at vertex b the second player attempts to choose a clause vertex such that 
the first player has no moves: that is, every vertex reachable from the clause vertex chosen by 
the second player has already been visited. On the other hand, if all clauses are satisfiable, 
then for every clause chosen by the second player there should be an edge from its vertex to 
a diamond vertex that has not been previously visited. To insure that the first player wins if 
and only if the instance of AQSAT used to construct this graph is a "Yes" instance, the first 
player always chooses an edge according to the directions in Fig. 8.19. For example, it visits 
the vertex labeled X\ if it wishes to set x \ = 1 because this means that the vertex labeled x\ 
is not visited on the path from t to b and can be visited by the first player on the last step of 
the game. Since each vertex labeled m and b is visited before a clause vertex is visited, the 
second player does not have a move and loses. ■ 



8.13 The Circuit Model of Computation 



The complexity classes seen so far in this chapter are defined in terms of the space and 
time needed to recognize languages with deterministic and nondeterministic Turing machines. 
These classes generally help us to understand the complexity of serial computation. Circuit 
complexity classes, studied in this section, help us to understand parallel computation. 

Since a circuit is a fixed interconnection of gates, each circuit computes a single Boolean 
function on a fixed number of inputs. Thus, to compute the unbounded set of functions 
computed by a Turing machine, a family of circuits is needed. In this section we investigate 
uniform and non-uniform circuit families. A uniform family of circuits is a potentially un- 
bounded set of circuits for which there is a Turing machine that, given an integer n in unary 
notation, writes a description of the nth circuit. We show that uniform circuits compute the 
same functions as Turing machines. 

As mentioned below, non-uniform families of circuits are so powerful that they can com- 
pute functions not computed by Turing machines. Given the Church-Turing thesis, it doesn't 
make sense to assume non-uniform circuits as a model of computation. On the other hand, if 
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we can develop large lower bounds on the size or depth of circuits without regard to whether or 
not they are drawn from a uniform family, then such lower bounds apply to uniform families 
as well and, in particular, to other models of computation, such as Turing machines. For this 
reason non-uniform circuits are important. 

A circuit is a form of unstructured parallel machine, since its gates can operate in parallel. 
The parallel random-access machine (PRAM) introduced in Chapter 1 and examined in Chap- 
ter 7 is another important parallel model of computation in terms of which the performance 
of many other parallel computational models can be measured. In Section 8.14 we show that 
circuit size and depth are related to number of processors and time on the PRAM. These results 
emphasize the important role of circuits not only in the construction of machines, but also in 
measuring the serial and parallel complexity of computational problems. 

Throughout the following sections we assume that circuits are constructed from gates cho- 
sen from the standard basis £7 = {AND, OR, NOT}. 

We now explore uniform and non-uniform circuit families, thereby setting the stage for 
the next chapter, in which methods for deriving lower bounds on the size of circuits are devel- 
oped. After introducing uniform circuits we show that uniform families of circuits and Turing 
machines compute the same functions. We then introduce a number of languages defined in 
terms of the properties of families of circuits that recognize them. 

8.13.1 Uniform Families of Circuits 

Families of circuits are useful in characterizing decision problems in which the set of instances 
is unbounded. One circuit in each family is associated with the "Yes" instances of each length: 
it has value 1 on the "Yes" instances and value otherwise. 

Families of circuits are designed in Chapter 3 to simulate computations by finite-state, 
random-access, and Turing machines on arbitrary numbers of inputs. For each machine M 
of one of these types, there is a DTM S(M) such that on an input of length n, S(M) can 
produce as output the description of a circuit on n inputs that computes exactly the same 
function as does M on n inputs. (See the program in Fig. 3.27.) These circuits are generated 
in a uniform fashion. 

On the other hand, non-uniform circuit families can be used to define non-computable 
languages. For example, consider the family in which the nth circuit, C n , is designed to have 
value 1 on those strings w of length n in the language C\ defined in Section 5.7 and value 
otherwise. Such a circuit realizes the minterm defined by w. As shown in Theorem 5.7.4, C\ 
is not recursively enumerable; that is, there is no Turing machine that can recognize it. 

This example motivates the need to identify families of circuits that compute functions 
computable by Turing machines, that is, uniform families of circuits. 

DEFINITION 8. 1 3. 1 A circuit family C = {C\, C2, C3, . . .} is a collection of logic circuits in 
which C n has n inputs and m(n) outputs for some function m : IN 1— » IN. 

A time-r(n) (space-r(n)) uniform circuit family is a circuit family for which there is a 
deterministic Turing machine M such that for each integer n supplied in unary notation, namely 
\ n , on its input tape, M writes the description of C n on its output tape using time (space) r(n), 

A log-space uniform circuit family is one for which the temporary storage space used by a 
Turing machine that generates it is 0(log n), where n is the length of the input. The function 
/ : B* 1— > B* is computed by C if for each n > 1, / restricted to n inputs is the function 
computed by C n . 
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8.13.2 Uniform Circuits Are Equivalent to Turing Machines 

We now show that the functions computed by log-space uniform families of circuits and by 
polynomial-time DTMs are the same. Since the family of functions computed by one-tape 
and multi-tape Turing machines are the same (see Theorem 5.2.1), we prove the result only 
for the standard one-tape Turing machine and proper resource functions (see Section 8.3). 

THEOREM 8. 1 3. 1 Let p(n) be a polynomial and a proper function. Then every total function 
/ : B* i— > B* computed by a DTM in time p(n) on inputs of length n can be computed by a 
log-space uniform circuit family C. 

Proof Let f n :B n t—> B* be the restriction to inputs of length n of the function / : B* i— > 
B* computed by a DTM M in time p(n). It follows that the number of bits in the word 
f n {w) is at most p{n). Since the function computed by a circuit has a fixed-length output 
and the length of f n (w) may vary for different inputs w of length n, we show how to create 
a DTM M*, a modified version of A/, that computes /*, a function that contains all the 
information in the function /„. The value of/* has at most 2p(n) bits on inputs of length 
n. We show that M* produces its output in time 0(p (n)). 

Let M* place a mark in the 2p(n)th cell on its tape (a cell beyond any reached during 
a computation). Let it now simulate M, which is assumed to print its output in the first 
k locations on the tape, k < p(n). M* now recodes and expands this binary string into a 
longer string. It does so by marking k cells to right of the output string (in at most k 2 steps), 
after which it writes every letter in the output string twice. That is, appears as 00 and 1 
as 1 1. Finally, the remaining 2(p(n) — k) cells are filled with alternating 0s and Is. Clearly, 
the value of /„ can be readily deduced from the output, but the length of the value /* is the 
same on all inputs of length n. 

A Turing machine Ale that constructs the nth circuit from n represented in unary and a 
description of M* invokes a slightly revised version of the program of Fig. 3.27 to construct 
the circuit computing /„ . This revised circuit contains placeholders for the values of the 
n letters representing the input to M. The program uses space O (log p(n)), which is 
logarithmic in n. ■ 

We now show that the function computed by a log-space uniform family of circuits can be 
computed by a polynomial-time Turing machine. 

THEOREM 8.13.2 Let C be a log-space uniform circuit family. Then there exists a polynomial-time 
Turing machine M that computes the same set of functions computed by the circuits in C. 

Proof Let Mq be the log-space TM that computes the circuit family C. We design the TM 
AI to compute the same set of functions on an input w of length n. M uses w to obtain a 
unary representation for the input Mc ■ It uses Me to write down a description of the nth 
circuit on its work tape. It then computes the outputs of this circuit in time quadratic in the 
length of the circuit. Since the length of the circuit is a polynomial in n because the circuit 
is generated by a log-space TM (see Theorem 8.5.8), the running time of M is polynomial 
in the length of w. ■ 

These two results can be generalized to uniform circuit families and Turing machines that 
use more than logarithmic space and polynomial time, respectively. (See Problem 8.32.) 
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In the above discussion we examine functions computed by Turing machines. If these 
functions are characteristic functions, / : B* i— > B; that is, they have value or 1, then 
those strings for which / has value 1 define a language Lf. Also, associated with each language 
L C B* is a characteristic function Jl '■ B* (— > i3 that has value 1 on only those strings in L. 

Consider now a language L C B* . For each n > 1 a circuit can be constructed whose 
value is 1 on binary strings in L fl B n and otherwise. Similarly, given a family C of circuits 
such that for each natural number n > 1 the nth circuit, C„, computes a Boolean function 
on n inputs, the language L associated with this circuit family contains only those strings of 
length n for which C n has value 1 . We say that L is recognized by C. At the risk of confusion, 
we use the same name for a circuit family and the languages they define. 

In Theorem 8.5.6 we show that NSPACE(r(n)) C TSME(k losn+r ^ n) ). We now use 
the ideas of that proof together with the parallel algorithm for transitive closure given in Sec- 
tion 6.4 to show that languages in NSPACE(r(n)), r(n) > log n, are recognized by a uniform 
family of circuits in which the nth circuit has size O(fc lo s™+ r ( n )) and depth <9(r 2 (n)). When 
r(n) = O(logn), the circuit family in question is contained in the class NC 2 introduced in 
the next section. 

THEOREM 8. 13.3 If language L C B* is inNSPACE(r(n)), r(n) > logn, there exists a time- 
r(n) uniform family of circuits recognizing L such that the nth circuit has size 0(k °s n + r ( n )'j 
and depth 0{r 2 (n)) for some constant k. 

Proof We assume without loss of generality that the NDTM accepting L has one accepting 
configuration. We then construct the adjacency matrix for the configuration graph of M. 
This matrix has a 1 entry in row i, column j if there is a transition from the ith to the 
jth configuration. All other entries are 0. From the analysis of Corollary 8.5.1, this graph 
has 0(k gn+r ( n >) configurations. The initial configuration is determined by the word w 
written initially on the tape of the NDTM accepting L. If the transitive closure of this 
matrix has a 1 in the row and column corresponding to the initial and final configurations, 
respectively, then the word w is accepted. 

From Theorem 6.4. 1 the transitive closure of a Boolean pxp matrix A can be computed 
by computing (I + A) q for q > p — 1. This can be done by squaring A s times for 
S > log 2 p. From this we conclude that the transitive closure can be computed by a circuit 
of depth 0(log m), where m is the number of configurations. Since m = 0(k sn+r ^ n '), 
we have the desired circuit size and depth bounds. 

A program to compute the <ith power of an p x p matrix A is shown in Fig. 8.20. This 
program can be converted to one that writes the description of a circuit for this purpose, 
and both the original and converted programs can be realized in space O(rflogp). (See 



trans {A, n, d, i,j) 
if d = then 

return (a^j) 
else 

return(^ fe=1 trans(A, n, d — 1, i, k) * transCA, n, d — 1, k,j) ) 



Figure 8.20 A recursive program to compute the dth power of an n x n matrix A. 
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Problem 8.33.) Invoking this procedure to write a program for the above problem, we see 
that an 0(r 2 (n))-depth circuit recognizing L can be written by an 0(r 2 (n))-time DTM.B 



8.14 The Parallel Random-Access Machine Model 

The PRAM model, introduced in Section 7.9, is an abstraction of realistic parallel models that 
is sufficiently rich to permit the study of parallel complexity classes. (See Fig. 7.21, repeated as 
Fig. 8.21.) The PRAM consists of a set of RAM processors with a bounded number of memory 
locations and a common memory. The words of the common memory are allowed to be of 
unlimited size, but the instructions that the RAM processors can apply to them are restricted. 
These processors can perform addition, subtraction, vector comparison operations, conditional 
branching, and shifts by fixed amounts. We also allow load and store instructions for moving 
words between registers, local memories, and the common memory. These instructions are 
sufficiently rich to compute all computable functions. 

In the next section we show that the CREW (concurrent read/exclusive write) PRAM that 
runs in polynomial time and the log-space uniform circuits characterize the same complexity 
classes. We then go on to explore the parallel complexity thesis, which states that sequential 
space and parallel time are polynomially related. 

8.14.1 Equivalence of the CREW PRAM and Circuits 

Because a parallel machine with p processors can provide a speedup of at most a factor of p over 
a comparable serial machine (see Theorem 7.4.1), problems that are computationally infeasi- 
ble on serial machines are computationally infeasible on parallel machines with a reasonable 
number of processors. For this reason the study of parallelism is usually limited to feasible 
problems, that is, problems that can be solved in serial polynomial time (the class P). We limit 
our attention to such problems here. 
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Figure 8.2 I The PRAM consists of synchronous RAMs accessing a common memory. 
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Connections between PRAMs and circuits can be derived that are similar to those stated 
for Turing machines and circuits in Section 8.13.2. In this section we consider only log-space 
uniform families of circuits. 

Given a PRAM, we now construct a circuit simulating it. This construction is based 
on that given in Section 3.4. With a suitable definition of log-space uniform family of 
PRAMs the circuits described in the following lemma constitute a log-space uniform family 
of circuits. (See Problem 8.35.) Also, this theorem can be extended to PRAMs that access 
memory locations with addresses much larger than 0(p(n)t(n)), perhaps through indirect 
addressing. (See Problem 8.37.) 

LEMMA 8.14.1 Consider a function on input words of total length n bits computed by a CREW 
PRAM P in time t(n) with a polynomial number of processors p(n) in which the largest common 
memory address is 0(p(n)t(n)). This function can be computed by a circuit of size 0(p 2 (n)t(n) 
+ p(n)t 2 (n)) anddepth O (\og( p(n)t(n))). 

Proof Since P executes at most t(n) steps, by a simple extension to Problem 8.4 (only one 
RAM CPU at a time writes a word), we know that after t(n) steps each word in the common 
memory of the PRAM has length at most 6 = t(n) + n + K for some constant K > 0, 
because the PRAM can only compare or add numbers or shift them left by one position on 
each time step. This follows because each RAM CPU uses integers of fixed length and the 
length of the longest word in the common memory is initially n. 

We exhibit a circuit for the computation by P by modifying and extending the circuit 
sketched in Section 3.4 to simulate one RAM CPU. This circuit uses the next-state/output 
circuit for the RAM CPU together with the next-state/output circuit for the random-access 
memory of Fig. 3.21 (repeated in Fig. 8.22). The circuit of Fig. 8.22(a) either writes a new 
value dj for w* -, the jth component of the ith memory word of the random-access memory, 
or it writes the old value Wij . The circuit simulating the common memory of the PRAM 
is obtained by replacing the three gates at the output of the circuit in Fig. 8.22(a) with a 
subcircuit that assigns to w* ' ■ the value ofwtj if Cj = for each RAM CPU and the OR of 
the values of dj supplied by each RAM CPU if Cj = 1 for some CPU. Here we count on the 
fact that at most one CPU addresses a given location for writing. Thus, if a CPU writes to 
a location, all other CPUs cannot do so. Concurrent reading is simulated by allowing every 
component of every memory cell to be used as input by every CPU. 

Since the longest word that can be constructed by the CREW PRAM has length b = 
t(n)+n+K, it follows from Lemma 3.5.1 that the next-state/output circuit for the random- 
access memory designed for one CPU has size 0(p(n)t 2 (n)) and depth O (log(p(n)i(n))). 
The modifications described in the previous paragraph add size 0{p 2 {n)t{n)) (each of the 
p{n)t{n) memory words has 0{p{n)) new gates) and depth 0{\ogp{n)) (each OR tree 
has pin) inputs) to this circuit. As shown at the end of Section 3.10, the size and depth 
of a circuit for the next-state/output circuit of the CPU are 0{t{n) + log(p(ri)t(n))) and 
0(logt(n) + loglog(p(n)t(n))) , respectively. Since these sizes and depths add to those 
for the common memory, the total size and depth for the next-state/output circuit for the 
PRAM are 0(p 2 (n)t(n) + p(n)t 2 (n)) and O (log(p(n)i(n))), respectively. ■ 

We now show that the function computed by a log-space uniform circuit family can be 
computed in poly-logarithmic time on a PRAM. 
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Figure 8.22 A circuit for the next-state and output function of the random-access memory. 
The circuit in (a) computes the next values for components of memory words, whereas that in (b) 
computes components of the output word. This circuit is modified to generate a circuit for the 
PRAM. 



LEMMA 8. 1 4.2 Let C = (C\, C%, . . .} be a log-space uniform family of circuits. There exists a 
CREW PRAM that computes in poly-logarithmic time and a polynomial number of processors the 
function f : B* i— > B* computed by C. 

Proof The CREW PRAM is given a string w on which to compute the function /. First 
it computes the length n of w. Second it invokes the CREW PRAM described below to 
simulate with a polynomial number of processors in poly-logarithmic time the log-space 
DTM AI that writes a description of the nth circuit, C{M , n). Finally we show that the 
value of C{M, n) can be evaluated from this description by a CREW PRAM in 0(log n) 
steps with polynomially many processors. 

Let M be a three-tape DTM that realizes a log-space transformation. This DTM has 
a read-only input tape, a work tape, and a write-only output tape. Given a string w on its 
input tape, it provides on its output tape the result of the transformation. Since M uses 
O(logn) cells on its work tape on inputs of length n, it can be modeled by a finite-state 
machine with 2 ' lo S") states. The circuit C{M,n) described in Theorem 3.2.2 for the 
simulation of the FSM AI is constructed to simulate M on inputs of length n. We show 
that C(M, n) has size and depth that are polynomial and poly-logarithmic in n, respectively. 
We then demonstrate that a CREW PRAM can simulate C(M, n) (and write its output into 
its common memory) in 0(log n) steps with a polynomial number of processors. 
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From Theorem 8.5.8 we know that the log-space DTM M generating C(M,n) does 
not execute more than p(n) steps, p(n) a polynomial in n. Since p(n) is assumed proper, 
we can assume without loss of generality that M executes p(n) steps on all inputs of length 
n. Thus, M has exactly \Q\ = 0(p(n)) configurations. 

The input string w is placed in the first n locations of the otherwise blank common 
memory. To determine the length of the input, for each i the ith CREW PRAM processor 
examines the words in locations i and i + 1. If location i + 1 is blank but location i is not, 
i = n. The nth processor then computes p(n) in 0(log n) serial steps (see Problem 8.2) 
and places it in common memory. 

The circuit C(M, n) is constructed from representations of next-state mappings, one 
mapping for every state transition. Since there are no external inputs to M (all inputs are 
recorded on the input tape before the computation begins), all next-state mappings are the 
same. As shown in Section 3.2, let this one mapping be defined by a Boolean \Q\ x \Q\ 
matrix Ma whose rows and columns are indexed by configurations of M. A configuration 
of M is a tuple (q, hi, /12, h$, x) in which q is the current state, h\, hi, and h$ are the 
positions of the heads on the input, output, and work tapes, respectively, and x is the cur- 
rent contents of the work tape. Since M computes a log-space transformation, it executes a 
polynomial number of steps. Thus, each configuration has length O(logra). Consequently, 
a single CREW PRAM can determine in O(logn) time whether an entry in row r and 
column c, where r and c are associated with configurations, has value or 1 . For concrete - 
ness, assign PRAM processor i to row r and column c of Ma, where r = \i/p(n)~\ and 
c = i — r x p(n), quantities that can be computed in 0(log n) steps. 

The circuit C(M, n) simulating M is obtained via a prefix computation on p(n) copies 
of the matrix Ma using matrix multiplication as the associative operator. (See Section 3.2.) 

Once C(M, n) has been written into the common memory, it can be evaluated by 
assigning one processor per gate and then computing its value as many times as the depth of 
C(M, n). This involves a four-phase operation in which the jth processor reads each of the 
at most two arguments of the jth gate in the first two phases, computes its value in the third, 
and then writes it to common memory in the fourth. This process is repeated as many times 
as the depth of the circuit C(M, n), thereby insuring that correct values for gates propagate 
throughout the circuit. Again concurrent reads and exclusive writes suffice. ■ 

These two results (and Problem 8.37) imply the result stated below, namely, that the bi- 
nary functions computed by circuits with polynomial size and poly-logarithmic depth are the 
same as those computed by the CREW PRAM with polynomially many processors and poly- 
logarithmic time. 

THEOREM 8.14.1 The functions f : B* 1— > B* computed by circuits of polynomial-size and poly- 
logarithmic depth are the same as those computed by the CREW PRAM with a polynomial number 
of processors and poly-logarithmic time. 

8.14.2 The Parallel Computation Thesis 

A deep connection exists between serial space and parallel time. The parallel computation 
thesis states that sequential space and parallel time are polynomially related; that is, if there 
exists a sequential algorithm that uses space S, then there exists a parallel algorithm using time 
p(S) for some polynomial p and vice versa. There is strong evidence that this hypothesis holds. 
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In this section we set the stage for discussing the parallel computation thesis in a limited 
way by showing that every log-space reduction (on a Turing machine) can be realized by a 
CREW PRAM in time O (log n) with polynomially many processors. This implies that if a 
P-complete problem can be solved on a PRAM with polynomially many processors in poly- 
logarithmic time, then so can every problem in P, an unlikely prospect. 

LEMMA 8. 1 4.3 Log-space transformations can be realized by CREW PRAMs with polynomially 
many processors in time 0(log n). 

Proof We use the CREW PRAM described in the proof of Lemma 8.14.2. The processors 
in this PRAM are then assigned to perform the matrix operations in the order required 
for a parallel prefix computation. (See Section 2.6.) If we assign |<5(n)| 2 processors per 
matrix multiplication operation, each operation can be done in 0(log \Q(n) | 2 ) = <9(log n) 
steps. Since the prefix computation has depth 0(log n), the PRAM can perform the prefix 
computation in time 0(log n). The number of processors usedisp(n)-0(|Q(n)| 2 ), which 
is a polynomial in n. Concurrent reads and exclusive writes suffice for these operations. ■ 

Since a log-space transformation can be realized in poly-logarithmic time with polynomi- 
ally many processors on a CREW PRAM, if a CREW PRAM solves a P-complete problem in 
poly-logarithmic time, we can compose such machines to form a CREW PRAM with poly- 
logarithmic time and polynomially many processors to solve an arbitrary problem in P. 

THEOREM 8. 14.2 If a F '-complete problem can be solved in poly-logarithmic time with polyno- 
mially many processors on a CREW PRAM, then so can all problems in P and all problems in P 
are fully parallelizable. 



8.15 Circuit Complexity Classes 



In this section we introduce several important circuit complexity classes including NC, the 
languages recognized by uniform families of circuits whose size and depth are polynomial and 
poly-logarithmic in n, respectively, and P/poly, the largest set of languages L C B* with the 
property that L is recognized by a (non-uniform) circuit family of polynomial size. We also 
derive relationships among these classes and previously defined classes. 

8.15.1 Efficiently Parallelizable Languages 

DEFINITION 8. 1 5. 1 The class NC contains those languages L recognized by a uniform family of 
Boolean circuits of polynomial size and depth 0(log n) in n, the length of an input. The class 
NC is the union of the classes NC , k > 1; that is, 

NC = (J NC fc 

k>\ 

In Section 8.14 we explored the connection between circuit size and depth and PRAM 
time and number of processors and concluded that circuits having polynomial size and poly- 
logarithmic depth compute the same languages as do PRAMs with a polynomial number of 
processors and poly-logarithmic parallel time. 
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The class NC is considered to be the largest feasibly parallelizable class of languages. By fea- 
sible we mean that the number of gates (equivalently processors) is no more than polynomial 
in the length n of the input and by parallelizable we mean that circuit depth (equivalently 
computation time) must be no more than poly-logarithmic in n. Feasibly parallelizable lan- 
guages meet both requirements. 

The prefix circuits introduced in Section 2.6 belong to NC 1 , as do circuits constructed 
with prefix operations, such as binary addition and subtraction (see Section 2.7) and the cir- 
cuits for solutions of linear recurrences (see Problem 2.24). (Strictly speaking, these functions 
are not predicates and do not define languages. However, comparisons between their values 
and a threshold converts them to predicates. In this section we liberally mix functions and 
predicates.) The class NC 1 also contains functions associated with integer multiplication and 
division. 

The fast Fourier transform (see Section 6.7.3) and merging networks (see Section 6.8) can 
both be realized by algebraic and combinatorial circuits of depth O(logn), where n is the 
number of circuit inputs. If the additions and multiplications of the FFT are done over a ring 
of integers modulo m for some m, the FFT can be realized by a circuit of depth 0(log" n). If 
the items to be merged are represented in binary, a comparison operator can be realized with 
depth 0(log n) and merging can also be done with a circuit of depth 0(log n). Thus, both 
problems are in NC 2 . 

When matrices are defined over a field of characteristic zero, the inverse of invertible ma- 
trices (see Section 6.5.5) can be computed by an algebraic circuit of depth 0(log n). If the 
matrix entries when represented as binary numbers have size n, the ring operations may be 
realized in terms of binary addition and multiplication, and matrix inversion is in NC . 

Also, it follows from Theorem 8.13.3 that the nth circuit in the log-space uniform families 
of circuits has polynomial size and depth <9(log n); that is, it is contained in NC 2 . Also 
contained in this set is the transitive closure of a Boolean matrix (see Section 6.4). Since the 
circuits constructed in Chapter 3 to simulate finite-state machines as well as polynomial-time 
Turing machines are log-space uniform (see Theorem 8.13.1), each of these circuit families is 
inNC 2 . 

We now relate these complexity classes to one another and to P. 

THEOREM 8. 1 5. 1 For k > 2, NC 1 CLCNLC NC 2 C NC fc C NC C P. 

Proof The containment L C NL is obvious. The containment NL C NC 2 is a restriction 
of the result of Theorem 8.13.3 to r(n) = O(logn). The containments NC 2 C NC fc C 
NC follow from the definitions. The last containment, NC C P, is a consequence of the 
fact that the circuit on n inputs in a log-space uniform family of circuits, call it C„, can 
be generated in polynomial time by a Turing machine that can then evaluate C n in a time 
quadratic in its length, that is, in polynomial time. (Theorems 8.5.8 and 8.13.2 apply.) 

The first containment, namely NC 1 C L, is slightly more difficult to establish. Given a 
language L £ NC , consider the problem of recognizing whether or not a string w is in L. 
This recognition task is done in log-space by invoking two log-space transformations, as is 
now explained. 

The first log-space transformation generates the nth circuit, C n , in the family recogniz- 
ing L. C n has value 1 if w is in L and otherwise. By definition, C n has size polynomial 
in n. Also, each circuit is described by a straight-line program, as explained in Section 2.2. 
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The second log-space transformation evaluates the circuit with temporary work space 
proportional to the maximal length of such strings. If the strings identifying gates have 
larger length, their transformation would use more space. (Note that it is easy to identify 
gates with an 0(log n) -length string(s) by concatenating the number of each gate on the 
path to it, including itself.) For this reason we give an efficient encoding of gate locations. 

The gates of circuits in NC 1 generally have fan-out exceeding 1. That is, they have more 
than one parent gate in the circuit. We describe how to identify gates with strings that may 
associate multiple strings with a gate. We walk the graph, which is the circuit, starting from 
the output vertex and moving toward input vertices. The output gate is identified with the 
empty string string e. If we reach a gate g via a parent whose string is p, g is identified by 
pO or pi. If the parent has only one descendant, as would be the case for NOT gates and 
inputs, we represent g by pO. If it has two descendants, as would be the case for AND and 
OR, and g has the smaller gate number, its string is pO; otherwise it is pi. 

The algorithm to produce each of these binary strings can be executed in logarithmic 
space because one need only walk each path in the circuit from the output to inputs. The 
tuple defining each gate contains the gate numbers of its predecessors, O (log n) -length 
numbers, and the algorithm need only carry one such number at a time in its working mem- 
ory to find the location of a predecessor gate in the input string containing the description 
of the circuit. 

The second log-space transformation evaluates the circuit using the binary strings de- 
scribing the circuit. It visits the input vertex with the lexicographically smallest string and 
determines its value. It then evaluates the gate whose string is that of the input vertex minus 
the last bit. Even though it may have to revisit all gates on the path to this vertex to do this, 
O(logn) space is used. If this gate is either a) AND and the input has value 0, b) OR and 
the input has value 1, or c) NOT, the value of the gate is decided. If the gate has more than 
one input and its value is not decided, the other input to it is evaluated (the one with suffix 
1). Because the second input to the gate is evaluated only if needed, its value determines 
the value of the gate. This process is repeated at each gate in the circuit until the output 
gate is reached and its value computed. Since this procedure keeps only one path of length 
(9(log n) active at a time, the algorithm uses space 0(log n). ■ 

An important open question is whether the complexity hierarchy of this theorem collapses 
and, if so, where. For example, is it true that a problem in P is also in NC? If so, all serial 
polynomial-time problems are parallelizable with a number of processing elements polynomial 
in the length of the input and poly-logarithmic time, an unlikely prospect. 

8.15.2 Circuits of Polynomial Size 

We now examine the class of languages P/poly and show that they are exactly the languages 
recognized by Boolean circuits of polynomial size. To set the stage we introduce advice and 
pairing functions. 

DEFINITION 8. 1 5.2 An advice function a : IN \— > B* maps natural numbers to binary strings. 
A polynomial advice function is an advice function for which \a{n)\ < p(n) for p(n) a 
polynomial function in n. 

DEFINITION 8. 1 5.3 A pairing function <, >: B* x B* t—. ► B* encodes pairs of binary strings 
x and y with two end markers and a separator (a comma) into the binary string < x,y >. 
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Pairing functions can be very easy to describe and compute. For example, < x,y > can 
be implemented by representing by 01, 1 by 10, both < and > by 1 1, and , (comma) by 00. 
Thus, < 0010, 110 > is encoded as 1 1010110010010100111. It is clearly trivial to identify, 
extract, and decode each component of the pair. We are now prepared to define P/poly. 

DEFINITION 8. 1 5.4 Let a : IN i— > B* be a polynomial advice function. P/poly is the set of 

languages L = {w\ < w,a(\w\) > (z A} for which there is a language A in P. 

The advice a(|io|) given on a string to in a language L G P/poly is the same for all 
strings of the same length. Furthermore, < w, a(\w\) > must be easy to recognize, namely, 
recognizable in polynomial time. 

The subset of the languages in P/poly for which the advice function is the empty string is 
exactly the languages in P, that is, P C P/poly. 

The following result is the principal result of this section. It gives two different interpreta- 
tions of the advice given on strings. 

THEOREM 8. 1 5.2 A language L is recognizable by a family of circuits of polynomial size if and 
only ifL G P/poly. 

Proof Let L be recognizable by a family C of circuits of polynomial size. We show that it is 
in P/poly. 

Let C n be an encoding of the circuit C n in C that recognizes strings in L fl B n . Let the 
advice function a(n) = C n and let W G B* have length n. Then, w G B n if and only if 
the value of C n on w is 1. Since w has length polynomial in n, it) G B n if and only if the 
pairing function < w, a(\w\) > is an instance of CIRCUIT SAT, which has been shown to 
be in P. (See Theorem 8.13.2.) 

On the other hand, suppose that L G P/poly. We show that L is recognizable by circuits 
of polynomial size. By definition there is an advice function a : IN i— > B* and a language 
A G P for L such that for all w G L, < w,a(\w\) > G A. Since A G P, there is a 
polynomial-time DTM that accepts < w, a(\w\) >. By Theorem 8.13.1 there is a circuit 
of polynomial size that recognizes < w, a(\w\) >. The string a(|l/j|) is constant for strings 
w of length n. Thus, the circuit for A C~\ B" to which is supplied the constant string a(|iu|) 
is a circuit of length polynomial in n that accepts strings w in L. ■ 



Problems 

MATHEMATICAL PRELIMINARIES 

8.1 Show that if strings over an alphabet A with at least two letters are encoded over a 
one-letter alphabet (a unary encoding), then strings of length n over A require strings 
of length exponential in n in the unary encoding. 

8.2 Show that the polynomial function p(n) = Kin can be computed in 0(log n) serial 
steps from n and for constants K\ > 1 and k > 1 on a RAM when additions require 
one unit of time. 
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SERIAL COMPUTATIONAL MODELS 

8.3 Given an instance of satisfiability, namely, a set of clauses over a set of literals and values 
for the variables, show that the clauses can be evaluated in time quadratic in the length 
of the instance. 

8.4 Consider the RAM of Section 8.4.1. Let 1(1) be the length, measured in bits, of the 
contents X of the RAMs input registers. Similarly, let l(v) be the maximal length of any 
integer addressed by an instruction in the RAMs program. Show that after k steps the 
contents of any RAM memory location is at most k + l(X) + l(v). 

Given an example of a computation that produces a word of length k. 

Hint: Consider which instructions have the effect of increasing the length of an integer 

used or produced by the RAM program. 

8.5 Consider the RAM of Section 8.4.1. Assume the RAM executes T steps. Describe a 
Turing-machine simulation of this RAM that uses space proportional to T measured 
in bits. 

Hint: Represent each RAM memory location visited during a computation by an 
(address, contents) pair. When a RAM location is updated, fill the cells on the 
second tape containing the old (address, contents) pair with a special "blank" char- 
acter and add the new (address, contents) pair to the end of the list of such pairs. 
Use the results of Problem 8.4 to bound the length of individual words. 

8.6 Consider the RAM of Section 8.4.1. Using the result of Problem 8.5, describe a multi- 
tape Turing machine that simulates in 0{T^) steps a T-step computation by the RAM. 
Hint: Let your machine have seven tapes: one to hold the input, a second to hold 
the contents of RAM memory recorded as (address, contents) pairs separated and 
terminated by appropriate markers, a third to hold the current value of the program 
counter, a fourth to hold the memory address being sought, and three tapes for operands 
and results. On the input tape place the program to be executed and the input on which 
it is to be executed. Handle the second tape as suggested in Problem 8.5. When per- 
forming an operation that has two operands, place them on the fifth and sixth tapes 
and the result on the seventh tape. 

8.7 Justify using the number of tape cells as a measure of space for the Turing machine 
when the more concrete measure of bits is used for the space measure for the RAM. 

CLASSIFICATION OF DECISION PROBLEMS 

8.8 Given a Turing machine, deterministic or not, show that there exists another Turing 
machine with a larger tape alphabet that performs the same computation but in a num- 
ber of steps and number of tape cells that are smaller by constant factors. 

8.9 Show that strings in TRAVELING SALESPERSON can be accepted by a deterministic 
Turing machine in an exponential number of steps. 

COMPLEMENTS OF COMPLEXITY CLASSES 

8.10 Show that VALIDITY is log-space complete for coNP. 

8.1 1 Prove that the complements of NP-complete problems are coNP-complete. 
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8.12 Show that the complexity class P is contained in the intersection of NP and coNP. 

8.13 Demonstrate that coNP C PSPACE. 

8.14 Prove that if a coNP-complete problem is in NP, then NP = coNP. 

REDUCTIONS 

8.15 If Pi and V2 are decision problems, a Turing reduction from V\ to Vi is any OTM 
that solves V\ given an oracle for Vi- Show that the reductions of Section 2.4 are 
Turing reductions. 

8.16 Prove that the reduction given in Section 10.9.1 of a pebble game to a branching com- 
putation is a Turing reduction. (See Problem 8.15.) 

8.17 Show that if a problem V\ can be Turing-reduced to problem V2 by a polynomial-time 
OTM and V% is in P, then V\ is also in P. 

Hint: Since each invocation of the oracle can be done deterministically in polynomial 
time in the length of the string written on the oracle tape, show that it can be done in 
time polynomial in the length of the input to the OTM. 

8.18 a) Show that every fixed power of an integer written as a binary fc-tuple can be com- 

puted by a DTM in space O(k). 
b) Show that every fixed polynomial in an integer written as a binary fc-tuple can be 
computed by a DTM in space O(k). 

Hint: Show that carry-save addition can be used to multiply two fc-bit integers with 
work space O(k). 

HARD AND COMPLETE PROBLEMS 

8.19 The class of polynomial-time Turing reductions are Turing reductions in which the 
OTM runs in time polynomial in the length of its input. Show that the class of Turing 
reductions is transitive. 

P-COMPLETE PROBLEMS 

8.20 Show that numbers can be assigned to gates in an instance of MONOTONE CIRCUIT 
VALUE that corresponds to an instance of CIRCUIT VALUE in Theorem 8.9.1 so that 
the reduction from it to MONOTONE CIRCUIT VALUE can be done in logarithmic 
space. 

8.21 Prove that LINEAR PROGRAMMING described below is P-complete. 

LINEAR PROGRAMMING 

Instance: Integer-valued m x n matrix A and column m -vectors b and c. 

Answer: "Yes" if there is a rational column n-vector x > such that Ax < b and x 

T 
maximizes C x. 

NP-COMPLETE PROBLEMS 

8.22 A Horn clause has at most one positive literal (an instance of Xi). Every other literal 
in a Horn clause is a negative literal (an instance of Xj). HORN SATISFIABILITY is an 
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instance of SATISFIABILITY in which each clause is a Horn clause. Show that HORN 
SATISFIABILITY is in P. 

Hint: If all literals in a clause are negative, the clause is satisfied only if some associated 
variables have value 0. If a clause has one positive literal, say y, and negative literals, say 
X\,X2, ■ ■ ■ ,~X~k, then the clause is satisfied if and only if the implication X\ A x 2 A • • • A 
x k =^ V is true. Thus, y has value 1 when each of these variables has value 1 . Let T 
be a set variables that must have value 1 . Let T contain initially all positive literals that 
appear alone in a clause. Cycle through all implications and for each implication all 
of whose left-hand side variables appear in T but whose right-hand side variable does 
not, add this variable to T. Since T grows until all left-hand sides are satisfied, this 
procedure terminates. Show that all satisfying assignments contain T . 

8.23 Describe a polynomial-time algorithm to determine whether an instance of CIRCUIT 
SAT is a "yes" instance when the circuit in question consists of a layer of AND gates 
followed by a layer of OR gates. Inputs are connected to AND gates and the output gate 
is an OR gate. 

8.24 Prove that the CLIQUE problem defined below is NP-complete. 

CLIQUE 

Instance: The description of an undirected graph G = (V, E) and an integer k. 

Answer: "Yes" if there is a set of k vertices of G such that all vertices are adjacent. 

8.25 Prove that the HALF CLIQUE problem defined below is NP-complete. 

HALF CLIQUE 

Instance: The description of an undirected graph G = (V, E) in which \V | is even and 

an integer k. 

Answer: "Yes" if G contains a clique on | V\/2 vertices or has more than k edges. 

Hint: Try reducing an instance of CLIQUE on a graph with m vertices and a clique of 
size k to this problem by expanding the number of vertices and edges to create a graph 
that has \V\ > m vertices and a clique of size |l^|/2. Show that a test for the condition 
that G contains more than k edges can be done very efficiently by counting the number 
of bits among the variables describing edges. 

8.26 Show that the NODE COVER problem defined below is NP-complete. 

NODE COVER 

Instance: The description of an indirected graph G = (V, E) and an integer k. 
Answer: "Yes" if there is a set of k vertices such that every edge contains at least one of 
these vertices. 

8.27 Prove that the HAMILTONIAN PATH decision problem defined below is NP-complete. 

HAMILTONIAN PATH 

Instance: The description of an undirected graph G. 

Answer: "Yes" if there is a path visiting each node once. 

Hint: 3-SAT can be reduced to HAMILTONIAN PATH, but the construction is chal- 
lenging. First, add literals to clauses in an instance of 3-SAT so that each clause has 
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Figure 8.23 Gadgets used to reduce 3-SAT to HAMILTONIAN PATH. 



three literals. Second, construct and interconnect three types of subgraphs (gadgets). 
Figures 8.23(a) and (b) show the first and second of theses gadgets, G\ and £?2. 

There is one first gadget for each variable Xi, 1 < i < n, denoted G^j. The left path 
between the two middle vertices in G\^ is associated with the value Xi = 1 and the 
right path is associated with the complementary value, x% = 0. Vertex / of G\^ is 
identified with vertex e of Gi,i+i for 1 < i < n — 1, vertex e of G\ v \ is connected only 
to a vertex in Gij, and vertex / of G\ :Ti is connected to the clique described below. 
There is one second gadget for each literal in each clause. Thus, if Xi (Xi) is a literal in 
clause Cj, then we create a gadget G2,j,i,i (G2,j,z,o)- 

Since a HAMILTONIAN PATH touches every vertex, a path through G2,j,i,v for V G 
{0, 1} passes either from a to c or from b to d. 

For each 1 < i < n the two parallel edges of G\,i are broken open and two vertices 

appear in each of them. For each instance of the literal Xi (x^, connect the vertices a 

and c of G2,j,i,i (G2,j,i,o) to the pair of vertices on the left (right) that are created in 

G\ t i. Connect the b vertex of one literal in clause Cj to the d vertex of another one, as 

suggested in Fig. 8.23(c). 

The third gadget has vertices g and h and a connecting edge. One of these two vertices, 

h, is connected in a clique with the b and d vertices of the gadgets G2,j,i, v and the / 

vertex of G\ :Ti . 

This graph has a Hamiltonian path between g and the e vertex of Gij if and only if 

the instance of 3-SAT is a "yes" instance. 

.28 Show that the TRAVELING SALESPERSON decision problem defined below is NP- 
complete. 

TRAVELING SALESPERSON 

Instance: An integer k and a set of n(n — l)/2 distances {di^, d\^, . . . , d\ sn , ^2,3, ■ • ■ , 

d2, n , ■ ■ ■ , d n _i,n} between n cities. 

Answer: "Yes" if there is a tour (an ordering) {i\,t2, ■ ■ ■ ,i n } of the cities such that the 

length / = di u i, + di lt i 3 + ■ • • + <ii„,i, of the tour satisfies I < k. 



Hint: Try reducing HAMILTONIAN PATH to TRAVELING SALESPERSON. 
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8.29 Give a proof that the PARTITION problem defined below is NP-complete. 

PARTITION 

Instance: A set Q = {a\, 0,2, . . . , a„} of positive integers. 

Answer: "Yes" if there is a subset of Q that adds to jEkKti '' 

PSPACE-COMPLETE PROBLEMS 

8.30 Show that the procedure tree_eval described in the proof of Theorem 8.12.1 can 
be modified slightly to apply to the evaluation of the trees generated in the proof of 
Theorem 8.12.3. 

Hint: A vertex of in-degree k can be replaced by a binary tree of k leaves and depth 
log 2 k. 

THE CIRCUIT MODEL OF COMPUTATION 

8.31 Prove that the class of circuits described in Section 3.1 that simulate a finite-state ma- 
chine are uniform. 

8.32 Generalize Theorems 8.13.1 and 8.13.2 to uniform circuit families and Turing ma- 
chines that use more than logarithmic space and polynomial time, respectively. 

8.33 Write a 0(log n)-space program based on the one in Fig. 8.20 to describe a circuit for 
the transitive closure of an n x n matrix based on matrix squaring. 

THE PARALLEL RANDOM-ACCESS MACHINE MODEL 

8.34 Complete the proof of Lemma 8.14.2 by making specific assignments of data to mem- 
ory locations. Also, provide formulas for the assignment of processors to tasks. 

8.35 Give a definition of a log-space uniform family of PRAMs for which Lemma 8.14.1 
can be extended to show that the function / : B* i— > B* computed by a log-space fam- 
ily of PRAMs can also be computed by a log-space uniform family of circuits satisfying 
the conditions of Lemma 8.14.1. 

8.36 Exhibit a non-uniform family of PRAMs that can solve problems that are not recur- 
sively enumerable. 

8.37 Lemma 8.14.1 is stated for PRAMs in which the CPU does not access a common mem- 
ory address larger than 0(p(n)t(n)). In particular, this model does not permit indirect 
addressing. Show that this theorem can be extended to RAM CPUs that do allow 
indirect addressing by using the representation for memory accesses in Problem 8.6. 



Chapter Notes 

The classification of languages by the resources needed for their recognition is a very large 
subject capable of book-length study. The reader interested in going beyond the introduc- 
tion given here is advised to consult one of the readily available references. The Handbook of 
Theoretical Computer Science contains three survey articles on this subject by van Embde Boas 
[350], Johnson [151], and Karp and Ramachandran [161] 
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The first examines simulation of one computational model by another for a large range of 
models. The second provides a large catalog of complexity classes and relationships between 
them. The third examines parallel algorithms and complexity. Other sources for more infor- 
mation on this topic are the books by Hopcroft and Ullman [141], Lewis and Papadimitriou 
[200], Balcazar, Diaz, and Gabarro on structural complexity [27], Garey and Johnson [109] 
on the theory of NP-completeness, Greenlaw, Hoover, and Ruzzo [120] on P-completeness, 
and Papadimitriou [235] on computational complexity. 

The Turing machine was defined by Alan Turing in 1936 [338], as was the oracle Turing 
machine. Random-access machines were introduced by Shepherdson and Sturgis [308] and 
the performance of RAMs was analyzed by Cook and Reckhow [77] . 

Hartmanis, Lewis, and Stearns [127,128] gave the study of time and space complexity 
classes its impetus. Their papers contain many of the basic theorems on complexity classes, 
including the space and time hierarchy theorems stated in Section 8.5.1. The gap theorem 
was obtained by Trakhtenbrot [334] and rediscovered by Borodin [51]. Blum [46] developed 
machine-independent complexity measures and established a speedup theorem showing that 
for some languages there is no single fastest recognition algorithm [47] . 

Many individuals identified and recognized the importance of the classes P and NP. Cook 
[74] formalized NP, emphasized the importance of polynomial-time reducibility, and exhib- 
ited the first NP-complete problem, SATISFIABILITY. Karp [159] then demonstrated that 
a number of other combinatorial problems, including TRAVELING SALESPERSON, are NP- 
complete. Cook used Turing reductions in his classification whereas Karp used polynomial- 
time transformations. Independently and almost simultaneously Levin [199] (see also [335]) 
was led to concepts similar to the above. 

The relationship between nondeterministic and deterministic space (Theorem 8.5-5 and 
Corollary 8.5.1) was established by Savitch [297]. The proof that nondeterministic space 
classes are closed under complementation (Theorem 8.6.2 and Corollary 8.6.2) is indepen- 
dently due to Szelepscenyi [322] and Immerman [145]. 

Theorem 8.6.4, showing that PRIMALITY is in NP n coNP, is due to Pratt [257]. 

Cook [75] defined the concept of a P-complete problem and exhibited the first such prob- 
lem. He was followed quickly by Jones and Laaser [153] and Galil [108]. Ladner [185] showed 
that circuits simulating Turing machines (see [286]) could be constructed in logarithmic space, 
thereby establishing that CIRCUIT VALUE is P-complete. Goldschlager [117] demonstrated 
that MONOTONE CIRCUIT VALUE is P-complete. Valiant [345] and Cook established that 
LINEAR INEQUALITIES is P-hard, and Khachian [165] showed that this problem is in P. The 
proof that DTM ACCEPTANCE is P-complete is due to Johnson [151]. 

Cook [74] gave the first proof that SATISFIABILITY is NP-complete and also gave the 
reduction to 3-SAT. Independently, Levin [199] (see also [335]) was led to similar concepts 
for combinatorial problems. Schafer [299] showed that NAESAT is NP-complete. Karp [159] 
established that 0-1 INTEGER PROGRAMMING, 3-COLORING, EXACT COVER, SUBSET 
SUM, TASK SEQUENCING, and INDEPENDENT SET are NP-complete. 

The proof that 2-SAT is in NL (Theorem 8.1 1.1) is found in Papadimitriou [235]. 

Karp [159] exhibited a PSPACE-complete problem, Meyer and Stockmeyer [316] demon- 
strated that QUANTIFIED SATISFIABILITY is PSPACE-complete and Schafer established that 
GENERALIZED GEOGRAPHY is PSPACE-complete [299]. 

The notion of a uniform circuit was introduced by Borodin [52] and has been examined by 
many others. (See [120].) Borodin [52] established the connection between nondeterministic 
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space and circuit depth stated in Theorem 8.13.3. Stockmeyer and Vishkin [317] show how 
to simulate efficiently the PRAM with circuits and vice versa. (See also [161].) The class NC 
was defined by Cook [76]. Theorem 8.15.2 is due to Pippenger [249]. The class P/poly and 
Theorem 8.15.2 are due to Karp and Lipton [160]. 

A large variety of parallel computational models have been developed. (See van Embde 
Boas [350] and Greenlaw, Hoover, and Ruzzo [120].) The PRAM was introduced by Fortune 
and Wyllie [103] and Goldschlager [118,119]. 

Several problems on the efficient simulation of RAMs are from Papadimitriou [235]. 




CHAPTER 



Circuit Complexity 



The circuit complexity of a binary function is measured by the size or depth of the smallest 
or shallowest circuit for it. Circuit complexity derives its importance from the corollary to 
Theorem 3.9.2; namely, if a function has a large circuit size over a complete basis of fixed 
fan-in, then the time on a Turing machine required to compute it is large. The importance of 

(n) 

this observation is illustrated by the following fact. For n > 1, let f L be the characteristic 

(n) 

function of an NP-complete language L, where f L has value 1 on strings of length n in L 
and value otherwise. If /^ has super-polynomial circuit size for all sufficiently large n, then 
P^NP. 

In this chapter we introduce methods for deriving lower bounds on circuit size and depth. 
Unfortunately, it is generally much more difficult to derive good lower bounds on circuit 
complexity than good upper bounds; an upper bound measures the size or depth of a particular 
circuit whereas a lower bound must rule out a smaller size or depth for all circuits. As a 
consequence, the lower bounds derived for functions realized by circuits over complete bases 
of bounded fan-in are often weak. 

In attempting to understand lower bounds for complete bases, researchers have studied 
monotone circuits over the monotone basis and bounded-depth circuits over the basis {AND, 
OR, NOT} in which the first two gates are allowed to have unbounded fan-in. Formula size, 
which is approximately the size of the smallest circuit with fan-out 1, has also been studied. 
Lower bounds to formula size also produce lower bounds to circuit depth, a measure of the 
parallel time needed for a function. 

Research on these restricted circuit models has led to some impressive results. Exponential 
lower bounds on circuit size have been derived for monotone functions over the monotone 
basis and functions such as parity when realized by bounded-depth circuits. Unfortunately, 
the methods used to obtain these results may not apply to complete bases of bounded fan-in. 
Fortunately, it has been shown that the slice functions have about the same circuit size over 

both the monotone and standard (non-monotone) bases. This may help resolve the P = NP 
question, since there are NP-complete slice problems. 

Despite the difficulty of deriving lower bounds, circuit complexity continues to offer one 
of the methods of highest potential for distinguishing between P and NP. 



391 



392 Chapter 9 Circuit Complexity Models of Computation 

9.1 Circuit Models and Measures 

In this section we characterize types of logic circuits by their bases and the fan-in and fan- 
out of basis elements. We consider bases that are complete and incomplete and that have 
bounded and unbounded fan-in. We also consider circuits in which the fan-out is restricted 
and unrestricted. Each of these factors can affect the size and depth of a circuit. 

9.1.1 Circuit Models 

The (general) logic circuit is the graph of a straight-line program in which the variables have 
value or 1 and the operations are Boolean functions g : B p i— > B, p > 1 . (Boolean functions 
have one binary value. Logic circuits are defined in Section 1 .2 and discussed at length in 
Chapter 2.) The vertices in a logic circuit are labeled with Boolean operations and are called 
gates; the set of different gate types used in a circuit is called the basis (denoted Q) for the 
circuit. The fan-in of a basis is the maximal fan-in of any function in the basis. A circuit 
computes the binary function / : B n i— > B m , which is the mapping from the n circuit inputs 
to the to gate outputs designated as circuit outputs. 

The standard basis, denoted f2 , is the set {AND, OR, NOT} in which AND and OR have 
fan-in 2. The full two-input basis, denoted B 2 , consists of all two-input Boolean functions. 
The dyadic unate basis, denoted U2, consists of all Boolean functions of the form (x a A y ) c 
for constants a, b, c in B. Here x l = x and x° = x. 

A basis Q is complete if every binary function can be computed by a circuit over £7. The 
bases CIq, B2, and U2 are complete, as is the basis consisting of the NAND gate computing the 
function x NAND y = x A y. (See Problem 2.5.) 

The bounded fan-out circuit model specifies a bound on the fan-out of a circuit. As we 
shall see, the fan-out- 1 circuit plays a special role related to circuit depth. Each circuit of 
fan-out 1 corresponds to a formula in which the operators are the functions associated with 
vertices of the circuit. Figure 9.1 shows an example of a circuit of fan-out 1 over the standard 
basis and its associated formula. (See also Problem 9.9.) Although each input variable appears 
once in this example, Boolean functions generally require multiple instances of variables (have 
fan-out greater than 1). Formula size is studied at length in Section 9.4. 

To define the monotone circuits, we need an ordering of binary n-tuples. Two such tuples, 
x = (x\,X2, ■ ■ ■ j x-n) and y = (yi, J/2. • • • > Un)> are in the relation x < y if for all 1 < i < n, 
Xi < yu where < 0, 1 < 1, and < 1, but 1 £ 0. (Thus, 001011 < 101111, but 
011011 ^ 101111.) 

A monotone circuit is a circuit over the monotone basis mO n = {AND, OR} in which 
the fan-in is 2. There is a direct correspondence between monotone circuits and monotone 
functions. A monotone function is a function / : B n 1— > B m that is either monotone 
increasing, that is, for all x,y G B n , if x < y, then fix) < f(y), or is monotone 
decreasing, that is, for all i,t/£ B n , if x < y, then f{x) > f(y). Unless stated explicitly, a 
monotone function will be understood to be a monotone increasing function. 

A monotone Boolean function has the following expansion on the first variable, as the 
reader can show. (See Problem 9.10.) A similar expansion is possible on any variable. 

f(x u x 2 ,.. -,x n ) = f(0,x 2 ,-.-,x n ) V (x\ A f{l,x 2 ,.-.,x n )) 

By applying this expansion to every variable in succession, we see that each monotone function 
can be realized by a circuit over the monotone basis. Furthermore, the monotone basis f2 mon 
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y = ((((x 7 V x 6 ) A (x 5 V x 4 )) V x 3 ) A (x 2 A Xi)) 
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X7 X& X5 X4 

Figure 9. 1 A circuit of fan-out 1 over a basis with fan-in 2 and a corresponding formula. The 
value y at the root is the AND of the value (((#7 V x&) A (£5 V £4)) V 033) of the left subtree with 
the value (xj A Xi) of the right subtree. 



is complete for the monotone functions, that is, every monotone function can be computed 
by a circuit over the basis f2 mon . (See Problem 2.) 

In Section 9.6 we show that some monotone functions on n variables require monotone 
circuits whose size is exponential in n. In particular, some monotone functions requiring 
exponential-size monotone circuits can be realized by polynomial-size circuits over the standard 
basis ilo- Thus, the absence of negation can result in a large increase in circuit size. 

The bounded-depth circuit is a circuit over the standard basis Qq where the fan-in of AND 
and OR gates is allowed to be unbounded, but the circuit depth is bounded. The conjunctive 
and disjunctive normal forms and the product-of-sums and sum-of-products normal forms 
realize arbitrary Boolean functions by circuits of depth 2 over fig. (See Section 2.3.) In these 
normal forms negations are used only on the input variables. Note that any circuit over the 
standard basis can be converted to a circuit in which the NOT gates are applied only to the 
input variables. (See Problem 9.1 1.) 



9.1.2 Complexity Measures 

We now define the measures of complexity studied in this chapter. The depth of a circuit is 
the number of gates of fan-in 2 or more on the longest path in the circuit. (Note that NOT 
gates do not affect the depth measure.) 

DEFINITION 9. 1 . 1 The circuit size of a binary function f : B n 1— > B m with respect to the basis 
Q, denoted Cfj(/), is the smallest number of gates in any circuit for f over the basis Q. Thecitcait 
size with fan-out s, denoted C s> o(/), is the circuit size of f when the circuit fan-out is limited 
to at most s. 
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The circuit depth of a binary function f : B n i— > B m wzYA respect to the basis Cl, Dq(/), is 
the depth of the smallest depth circuit for f over the basis CI. The circuit depth with fan-out s, 
denoted D s> q (f), is the circuit depth of f when the circuit fan-out is limited to at most s. 

The formula size of a Boolean function f : B n t—> B with respect to a basis CI, Lfi(f), is the 
minimal number of input vertices in any circuit of fan-out 1 for f over the basis CI. 

It is important to note the distinction between formula and circuit size: in the former 
the number of input vertices is counted, whereas in the latter it is the number of gates. A 
relationship between the two is shown in Lemma 9.2.2. 

9.2 Relationships Among Complexity Measures 

In this section we explore the effect on circuit complexity measures of a change in either the 
basis or the fan-out of a circuit. We also establish relationships between circuit depth and 
formula size. 

9.2.1 Effect of Fan-Out on Circuit Size 

It is interesting to ask how the circuit size and depth of a function change as the maximal fan- 
out of a circuit is reduced. This issue is important in understanding these complexity measures 
and in the use of technologies that limit the fan-out of gates. The following simple facts about 
trees are useful in comparing complexity measures. (See Problem 9.2.) 

LEMMA 9.2. 1 A rooted tree of maximal fan-in r containing k vertices has at most k(r — 1) + 1 
leaves and a rooted tree with I leaves and fan-in r has at most I — I vertices with fan-in 1 or more 
and at most 2(1 — 1) edges. 

From the above result we establish the following connection between circuit size with fan- 
out 1 and formula size. 

LEMMA 9.2.2 Let CI be a basis of fan-in r. Tor each f : B n i— > B the following inequalities hold 
between formula size, Lq(J), andfzn-out-1 circuit size, G\n(/)-' 

(L n (f) - l)/(r - 1) < C h n(f) < 3£n(/) - 2 
Proof The first inequality follows from the definition of formula size and the first result 
stated in Lemma 9.2.1 in which k = G\n(/)- The second inequality also follows from 
Lemma 9.2.1. A tree with L^(f) leaves has at most Lq(/) — 1 vertices with fan-in of 2 or 
more and at most 2(Lq( f) — 1) edges between vertices (including the leaves). Each of these 
edges can carry a NOT gate, as can the output gate, for a total of at most 2Lq(/) — 1 NOT 
gates. Thus, a circuit of fan-out 1 has at most 3£n(/) — 2 gates. ■ 

As we now show, circuit size increases by at most a constant factor when the fan-out of the 
circuit is reduced to s for s > 2. Before developing this result we need a simple fact about a 
complete basis Cl, namely, that at most two gates are needed to compute the identity function 
i(x) = x, as shown in the next paragraph. If a basis contains AND or OR gates, the identity 
function can be obtained by attaching both of their inputs to the same source. 

We are done if Cl contains a function such that by fixing all but one variable, i(x) is 
computed. If not, then we look for a non-monotone function in Cl. Since some binary 
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functions are non-monotone (x, for example), some function g in a complete basis fl is non- 
monotone. This means there exist tuples x and y for g,x < y, such that g(x) = 1 > g(y) = 
0. Let u and v be the largest and smallest tuples, respectively, satisfying x < u < v < y 
and g{u) = 1 and g(v) = 0. Then u and v differ in at most one position. Without loss 
of generality, let that position be the first and let the values in the remaining positions in 
both tuples be (c2, . . . , c„). It follows that g{\, c^, . . . , c„) = and g(0, C2, ■ ■ ■ , c„) = 1 or 
g(x, C2, . . . , c„) = x. If l(Q) is the number of gates from Q needed to realize the identity 
function, then l(Q) = 1 or 2. 

THEOREM 9.2. 1 Let ft be a complete basis of fan-in r and let f : B n i— > B m . The following 
inequalities hold on C S) o(/).' 

Cn(f) < C s+l ,n(f) < C.M < C hn (f) 
Furthermore, C Si o(/) has the following relationship to Cfi(f) for s > 2: 

s- 1 



Cssi(f) < Cn(f) 1 + 



Proof The first set of inequalities holds because a smallest circuit with fan-out s is no smaller 
than a smallest circuit with fan-out s + 1, a less restrictive type of circuit. 

The last inequality follows by constructing a tree of identity functions at each gate whose 
fan-out exceeds s. (See Fig. 9.2.) If a gate has fan-out (f> > s, reduce the fan-out to s and 
then attach an identity gate to one of these s outputs. This increases the fan-out from s to 
s + s — 1 . If 4> is larger than this number, repeat the process of adding an identity gate k 
times, where k is the smallest integer such that s + k(s — 1) > <j> or is the largest integer 
such that s + (k - l)(s - 1) < <p. Thus, k < (<j> - l)/(s - 1). 

Let (pi denote the fan-out of the ith gate in a circuit for / of potentially unbounded 
fan-out and let fc, be the largest integer satisfying the following bound: 

kl <^ 



Then at most 5^,-(fcJ(f2) + 1) gates are needed in the circuit of fan-out s to realize /, 
one for the ith gate in the original circuit and kj(fl) gates for the fc, copies of the identity 





(a) (b) 

Figure 9.2 Conversion of a vertex with fan-out more than s to a subtree with fan-out s, 
illustrated for s = 2. 
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function at the ith gate. Note that J^ fa is the number of edges directed away from gates 
in the original circuit. But since each edge directed away from a gate is an edge directed into 
a gate, this number is at most rCci(f) since each gate has fan-in at most r. 

It follows that the smallest number of gates in a circuit with fan-out s for / satisfies the 
following bound: 

C aXi (f) < Ca(f) + l(Q) ]T (^y) ~ CnU) i 1 + ^T 

which demonstrates that circuit size with a fan-out s > 2 differs from the unbounded fan- 
out circuit size by at most a constant factor. ■ 

With the construction employed in Theorem 9.2.1, an upper bound can be stated on 
D a> n(f) that is proportional to the product of Dct(f) and logCn(/). (See Problem 9.12.) 
The upper bound stated above on C s ,n(f) can be achieved by a circuit that also achieves an 
upper bound on D St ci(f) that is proportional to Dq(/) and log r s [138]. 

9.2.2 Effect of Basis Change on Circuit Size and Depth 

We now consider the effect of a change in basis on circuit size and depth. In the next section 
we examine the relationship between formula size and depth, from which we deduce the effect 
of a basis change on formula size. 

LEMMA 9.2.3 Given two complete bases, fl a and fib, and a function f : B n i— > B m , the circuit 
size and depth of f in these two bases differ by at most constant multiplicative factors. 

Proof Because each basis is complete, every function in fl a can be computed by a fixed 
number of gates in fib, and vice versa. Given a circuit with basis fl a , a circuit with basis 
fib can be constructed by replacing each gate from fl a by a fixed number of gates from 
fib- This has the effect of increasing the circuit size by at most a constant factor. It follows 
that Cci a (f) = 0(Cq 6 (/)). Since this construction also increases the depth by at most a 
constant factor, it follows that D^ a (f) = Q(Dn b (/)). ■ 

9.2.3 Formula Size Versus Circuit Depth 

A logarithmic relationship exists between the formula size and circuit depth of a function, as 
we now show. If a formula is represented by a balanced tree, this result follows from the fact 
that the circuit fan-in is bounded. However, since we cannot guarantee that each formula 
corresponds to a balanced tree, we must find a way to balance an unbalanced tree. 

To balance a formula and provide a bound on the circuit depth of a function in terms of 
formula size, we make use of the multiplexer function /mux '■ & + " ' i— > B on three inputs 
/mux(a, J/i, J/o )• Here the value of a determines which of the two other values is returned. 

,(i) / v / yo a = 

/mux («. 2/i. l/o) = S 

[ 2/i a = 1 

This function can be realized by 

/mux(a,2/i,2/o) = (a A yo ) V (a A 2/1) 
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The measure d(Q) of a basis Q defined below is used to obtain bounds on the circuit depth of 
a function in terms of its formula size. 

DEFINITION 9.2. 1 Given a basis O of fan-in r, the constant d(fX) is defined as follows: 

-hi' 



d(ft) = (Ai(/« x ) + l)/log r 



Over the standard basis fio, d(fl ) = 3.419. 

We now derive a separator theorem for trees. This is a theorem stating that a tree can 
be decomposed into two trees of about the same size by removing one edge. We begin by 
establishing a property about trees that implies the separator theorem. 

LEMMA 9.2.4 Let T be a tree with n internal (non-leaf) vertices. If the fan-in of every vertex of 
T is at most r, then for any k, 1 < k < n, T has a vertex v such that the subtree T v rooted at v 
has at least k leaves but each of its children T Vj , T„, , . . . ,T V ,p< r, has fewer than k leaves. 

Proof If the property holds at the root, the result follows. If not, move to some subtree of 
T that has at least k leaves and apply the test recursively. Because a leaf vertex has one leaf 
vertex in its subtree, this process terminates on some vertex v at which the property holds. 
If it terminates on a leaf vertex, each of its children is an empty tree. ■ 

COROLLARY 9.2. 1 Let T be a tree of fan-in r with n leaves. Then T has a subtree T v rooted at 
a vertex v such that T v has at least \n/(r +1)] leaves but at most [rn/(r + 1)J . 

Proof Let v be the vertex of Lemma 9.2.4 and let k = \n/(r + 1)] . Since T v has at most 
r subtrees each containing no more than \n/(r + 1)] — 1 < n/(r + 1) leaves, the result 
follows. ■ 

We now apply this decomposition of trees to develop bounds on formula size. 

THEOREM 9.2.2 Let ft be a complete basis of fan-in r. Any function f : B n i— > B with formula 
size Ln(f) > 2 has circuit depth Dq(/) satisfying the following bounds: 

log r £n(/) < Dn(f) < d(n)log r L n (f) 

Proof The lower bound follows because a rooted tree of fan-in r with depth d has at most 
r leaves. Since Lfi(f) leaves are needed to compute / with a tree circuit over 51, the result 
follows directly. 

The derivation of the upper bound is by induction on formula size. We first establish 
the basis for induction: that Dfi(f) < d(fl) log r Lq(J) for Lfi(f) = 2. To show this, 
observe that any function / with Lfi(f) = 2 depends on at most two variables. There are 16 
functions on two variables (which includes the functions on one variable), of which 10 have 
the property that both variables affect the output. Each of these 10 functions can be realized 
from a circuit for /mux by adding at most one NOT gate on one input and one NOT on 
the output. (See Problem 9.13.) But, as seen from the discussion preceding Theorem 9.2.1, 
every complete basis contains a non-monotone function all but one of whose inputs can be 
fixed so that the functions computes the NOT of its one remaining input. Thus, a circuit 

with depth Dq ( /mux ) + 2 suffices to realize a function with Lfi(f) = 2. 
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The basis for induction is that Z?o ( /i 
which we now show. 



(1) 

mux 



d(0)log r L n (/) = (p a (/« x ) + l) (log r 2)/log r 
Da (f£l) + l)/ hg/ 



2< d(fi)log r L n (/)forL n (/) = 

r+ r 



> 1.7 (D n (/,&*) + l) > Ai(/«x) +2 



1 mux 



> l. 



since (r + l)/r < 1.5 and D n ( /„ 

The inductive hypothesis is that any function / with a formula size Lfi(f) < Lq — 1 
can be realized by a circuit with depth d(Q) log r Lfi(f). 

Let T be the tree associated with a formula for / of size L . The value computed by 
T can be computed from the function /mux using the values produced by three trees, as 
suggested in Fig. 9.3. The tree T v of Corollary 9.2.1 and two copies of T from which T v 
has been removed and replaced by in one case (the tree To) and 1 in the other (the tree 
2~i) are formed and the value of T v is used to determine which of T and T\ is the value T. 
Since T v has at least \Lq/(t +1)] and at most \rLo/(r + 1)J < L — 1 leaves, each of To 
and T\ has at most L — \L Q /(r + 1)] = [rL /(r + 1)J leaves. (See Problem 9.1.) Thus, 
all trees have at most \rLo/(r + 1)J < Lq — 1 leaves and the inductive hypothesis applies. 
Since the depth of the new circuit is the depth of /mux plus the maximum of the depths of 
the three trees, / has the following depth bound: 



Ai(/)< Ai /. 



;(1) 
mux 



d(fi)log, 



rLn(f) 
(r + 1) 



The desired result follows from the definition of d(Q). 





(a) 



(b) 



Figure 9.3 Decomposition of a tree circuit T for the purpose of reducing its depth. A large 
subtree T v is removed and its value used to select the value computed by two trees formed from 
the original tree by replacing the value of T v alternately by and 1 . 
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Combining this result with Lemma 9.2.3, we obtain a relationship between the formula 
sizes of a function over two different complete bases. 

THEOREM 9.2.3 Let fl a and fib be two complete bases with fan-in r a andri,, respectively. There 
is a constant a such that the formula size of a function f : B n i— > B with respect to these bases 
satisfies the following relationship: 

LnAf) < [W/)]° 
Proof Let £>n a (/) and Dn b (f) be the depth of/ over the bases fl a and fib, respectively. 
From Theorem 9.2.2, log ra L 0a (/) < £>n„ (/) and D Qi (/) < d(fl b ) \og n La b (f)- 

From Lemma 9.2.3 we know there is a constant d a ,b such that if a function / : B n i— > B 
has depth Z?^ (/) over the basis fib, then it has depth Da a (/) over the basis fl a , where 

DnSf) < da, b Dn b {f) 

The constant d a b is the depth of the largest-depth basis element of fib when realized by a 
circuit over fl a . 

Combining these facts, we have that 

Ln a {f) <{r a ) Dn « U) <{r a ) d - bDn > U) 

< l r \da,bd(n b )log rb Ln b (f) 

< L nb {f) da - bd{nb){log '-» ra) 

Here we have used the identity x s « z = Z s » s . ■ 

This result can be extended to the monotone basis. (See Problem 9.14.) We now derive a 
relationship between circuit size and depth. 



9.3 Lower-Bound Methods for General Circuits 

In Chapter 2 upper bounds were derived for a variety of functions, including logical, arith- 
metic, shifting, and symmetric functions as well as encoder, decoder, multiplexer, and demul- 
tiplexer functions. We also established lower bounds on size and depth of the most complex 
Boolean functions on n variables. In this section we present techniques for deriving lower 
bounds on circuit size and depth for particular functions when realized by general logic circuits. 

9.3.1 Simple Lower Bounds 

A function / : B n i— > B on n variables is dependent on its ith variable, x%, if there exist 
values C\, c 2 , . . . , c i _i,c i+ - i , . . . , c n such that 

f{c u C 2 , . . . ,Cj_i,0, Cj+i, . . .,C„) =£ /(ci,C 2 , . . . ,Ci-\, l,Ci+i, ■ ■ . ,Cn) 

This simple property leads to lower bounds on circuit size and depth that result from the 
connectivity that a circuit must have to compute a function depending on each of its variables. 
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THEOREM 9.3. 1 Let f : B n >—>Bbe dependent on each of its n variables. Then over each basis 
Vi of fan-in r, the size and depth of f satisfies the following lower bounds: 

c M ) > [^f 

Da(i)> [1»S,.'>1 

Proof Consider a circuit of size Cq(/) for /. Since it has fan-in r, it has at most rCfi(f) 
edges between gates. After we show that this circuit also has at least Cfi(f) + n — 1 edges, 
we observe that rCfi(f) > Cn(f) + n — 1, from which the conclusion follows. 

Since / depends on each of its n variables, there must be at least one edge attached to 
each of them. Similarly, because the circuit has minimal size there must be at least one edge 
attached to each of the Cn(f) gates except possibly for the output gate. Thus, the circuit 
has at least Cq(/) + n — 1 edges and the conclusion follows. 

The depth lower bound uses the fact that a circuit with depth d and fan-in r with the 
largest number of inputs is a tree. Such trees have at most r leaves (input vertices). Because 
/ depends on each of its variables, a circuit for / of depth d has at least n and at most r 
leaves, from which the depth lower bound follows. ■ 

This lower bound is the best possible given the information used to derive it. To see this, 
observe that the function f(x\, X2, ■ ■ ■ , x n ) = x\ A Xj A • • • A x n , which depends on each of 
its variables, has circuit size \(n — l)/(r — 1)] and depth [log r n\ over the basis containing 
the r-input AND gate. (See Problem 9.15.) 

9.3.2 The Gate-Elimination Method for Circuit Size 

The search for methods to derive large lower bounds on circuit size for functions over complete 
bases has to date been largely unsuccessful. The largest lower bounds on circuit size that have 
been derived for explicitly defined functions are linear in n, the number of variables on which 
the functions depend. Since most Boolean functions on n variables have exponential size (see 
Theorem 2.12.1), functions do exist that have high complexity. Unfortunately, this fact doesn't 
help us to show that any particular problem has high circuit size. In particular, it does not help 
us to show that P ^ NP. 

In this section we introduce the gate-elimination method for deriving linear lower bounds. 
When applied with care, it provides the strongest known lower bounds for complete bases. 
The gate-elimination method uses induction on the properties of a function / on n variables 
to show two things: a) a few variables of/ can be assigned values so that the resulting function 
is of the same type as /, and b) a few gates in any circuit for / can be eliminated by this 
assignment of values. After eliminating all variables by assigning values to them, the function 
is constant. Since the number of gates in the original circuit cannot be smaller than the number 
removed during this process, the original circuit has at least as many gates as were removed. 

We now apply the gate-elimination method to functions in the class Qf^ defined below. 
Functions in this class have at least three different subfunctions when any pair of variables 
ranges through all four possible assignments. 

DEFINITION 9.3. 1 A Boolean function f : B n i— > B belongs to the class Q£^ if for any two 

variables x% and Xj, f has at least three distinct subfunctions as Xi and Xj range over all possible 
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values. Furthermore, for each variable Xi there is a value Cj such that the subfunction off obtained 
by assigning cc, the value Ci is in Q 2 3 . 

The class Q\J contains the function / mod 3c : B n <— > B, as we show. Here z mod a is 
the remainder of z after removing all multiples of a. 

LEMMA 9.3. 1 For n > 3 and c £ {0, 1,2}, the function f ™ od 3 : B™ i-> £ defined below is 

in Q23 ■' 

/ mod 3,o( x i' #2, •••>£„) = ((l/ + c) mod 3) mod 2 

where y = y^._. a;, and ^2 tnd-\- denote integer addition. 

Proof We show that the functions / mod 3c , c £ {0, 1,2}, are all distinct when n > 1. 
When n = 1, the functions are different because / m od3o( x i) = ^1' /modsit 1 ') = 
X\, and / mo( j 32(^1) = 0. F° r n = 2, y can assume values in {0, 1,2}. Because the 
functions / ( ^ od 3 ( a; 1 , x 2 ) , / ^d 3> j ( a; 1 , x 2 ) , and / ( ^ od 32 ( a; 1 , x 2 ) have value 1 only when 
y = X\ + X2 = 1, 0, 2, respectively, the three functions are different. 

The proof of membership of / mod 3 c m Q23 ' s by induction. The base case is n = 3, 
which holds, as shown in the next paragraph. The inductive hypothesis is that for each 

ce {0,1,2}, ftll^Q^- 

To show that for n > 3, / ^.d 3 c ^ as at l east three distinct subfunctions as any two of its 
variables range over all values, let y* be the sum of the n — 2 variables that are not fixed and 
let c* be the sum of c and the values of the two variables that are fixed. Then the value of the 
function is ((y* + c") mod 3) mod 2 = (((j/* mod 3) + (c* mod 3)) mod 3) mod 2. 
Since (y* mod 3) and (c* mod 3) range over the values 0, 1, and 2, the three functions are 
different, as shown in the first paragraph of this proof. 

To show that for any variable Xi there is an assignment Cj such that /mod3c ls m 
Q^ 3 " 1) ,letc=0. ■ 

We now derive a lower bound on the circuit size of functions in the class Q 2 ™ ■ 

THEOREM 9.3.2 Over the basis of all Boolean functions on two inputs, CI, if f £ Q23 f or 
n > 3, then 

Cn(f)>2n-3 

Proof We show that / depends on each of its variables. Suppose it does not depend on 
Xi. Then, pick Xi and a second variable Xj and let them range over all four possible values. 
Since the value of Xi has no effect on /, / has at most two subfunctions as Xi and Xj range 
over all values, contradicting its definition. 

We now show that some input vertex Xi of a circuit for / has fan-out of 2 or more. 
Consider a gate g in a circuit for / whose longest path to the output gate is longest. (See 
Fig. 9.4.) Since the circuit does not have loops and no other vertex is farther away from the 
output, both of g's input edges must be attached to input vertices. Let Xi and Xj be the two 
inputs to this gate. If the fan-out of both of these input vertices is 1, they influence the value 
of / only through the one gate to which they are connected. Since this gate has at most two 
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Figure 9.4 A circuit in which gates 94 has maximal distance from the output gate gs. The input 
Xj has fan-out 2. 



values for the four assignments to inputs, / has at most two subfunctions, contradicting the 
definition of/. 

If n = 3, this fact demonstrates that the fan-out from the three inputs has to be at 
least 4, that is, the circuit has at least four inputs. From Theorem 9.3.1 it follows that 
Cq(/) > In — 3 for n = 3. This is the base case for a proof by induction. 

The inductive hypothesis is that for any /* G Q23 > Cn(/*) > 2(n — 1) — 3. From 
the earlier argument it follows that there is an input vertex Xi in a circuit for / € Q23 tnat 
has fan-out 2. Let Xi have that value that causes the subfunction /* of/ to be in Q 2i . 
Fixing Xi eliminates at least two gates in the circuit for / because each gate connected to x% 
either has a constant output, computes the identity, or computes the NOT of its input. The 
negation, if any, can be absorbed by the gate that precedes or follows it. Thus, 

C'n(f) > Ca(f*) + 2 > 2(n - 1) - 3 + 2 = In - 3 
which establishes the result. ■ 



in) 

As a consequence of this theorem, the function f mo(i 3 c requires at least 2n — 3 gates over 
the basis B 2 - It can also be shown to require at most 3n + 0(1) gates [86]. 

We now derive a second lower-bound result using the gate-elimination method. In this 
case we demonstrate that the upper bound on the complexity of the multiplexer function 
/mux : B 2 " +n (— > B introduced in Section 2.5.5, which is 2" +1 + 0(n-\/2™), is optimal to 
within an additive term of size 0(nv2"). (The multiplexer function is also called the storage 
access function.) We generalize the storage access function /g A ' : B n+ 1— > B slightly and 
write it in terms of a fc-bit address a and an n-tuple x, as shown below, where \a\ denotes the 
integer represented by the binary number a and 2 > n.\ 



,(m) 
/ mux 



,(2" 

•'8 



,(n,k) ( 
J SA ( a fe- 



ll,Clo,X n -i, 



.Xq) 



■\a\ 



Thus, /mux — ,/gA 

To derive a lower bound on the circuit size of / SA 
Boolean functions onn + k variables defined below. 



(n,k) 



we introduce the class F* ' of 
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DEFINITION 9.3.2 A Boolean function f : B n+k i-> B belongs to the class F^ n ' k) , 2 k > n, if 
for some set S C {0, 1, . . . ,n — I}, \S\ = s, 

/(a fe _i,. ..,ai,a ,x n -i,.. .,x ) = x\ a \ 

for \a\ G S. 

Clearly, /g A ' is a member of F„ ' . We now show that every function in Fs ' has circuit 
size that is at least 2s — 2. 

In the proof of Theorem 9.3.2 the gate-elimination method replaced variables with con- 
stants. In the following proof this idea is extended to replacing variables by functions. Applying 
this result, we have that Cn(fmux) > 2 n+1 — 1. 

THEOREM 9.3.3 Let f : B n+k i-> B belong to F^ n ' k) , 2 k > n. Then over the basis B 2 the 
circuit size of f satisfies the following bound: 

Ca(f) >2s-2 
Proof In the proof of Theorem 9.3.2 we used the fact that some input variable has fan-out 
2 or more, as deduced from a property of functions in Q23 • This fact does not hold for the 
storage access function (multiplexer), as can be seen from the construction in Section 2.5.5. 
Thus, our lower-bound argument must explicitly take into account the fact that the fan-out 
from some input can be 1 . 

The following proof uses the fact that the basis B 2 contains functions of two kinds, AND- 
type and parity-type functions. The former compute expressions of the form (x a A y ) c for 
Boolean constants a, b, c, where the notation x c denotes x when c = 1 and x when c = 0. 
Parity-type functions compute expressions of the form x © y © c for some Boolean constant 
c. (See Problem 9.19.) 

The proof is by induction on the value of s. In the base case s = 1 and the lower bound 
is trivially 0. The inductive hypothesis assumes that for s = s — 1, Cn(f) > 2(s' — 1) — 2. 
We let s = s and consider the following mutually exclusive cases: 

a) For some % G S, X% has fan-out 2. Replacing x% by a constant allows elimination of 
at least two gates, replaces S by S — {«}, which has size s — 1, and reduces / to 
/* G F^-i > from which we conclude that 

Cn(/) > 2 + C n (n > 2s' + 2 = 2s - 2 

b) For some i € S, Xi has fan-out 1 , its unique successor is a gate G of AND-type, and G 
computes the expression {x\ A g ) c for some function g of the inputs. Setting Xi = a 
sets 1° = a a = 0, thereby causing the expression to have value C , which is a constant. 
Since G cannot be the output gate, this substitution allows the elimination of G and at 

least one successor gate, reduces / to /* € Fs'—l ' anc ' re pl aces S by S — {i}, from 
which the lower bound follows. 

c) For some i G S, Xi has fan-out 1, its unique successor is a gate G of parity-type, and 
G computes the expression Xi © g © c for some function g of the inputs. Replace S by 
S—{i}. Since we ask that the output of the circuit be a; 1 a 1 for a G S — {i}, this output 
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cannot depend on the value of G because a change in Xi would cause the value of G to 
change. Thus, G is not the output gate and when a € S — {«} we can set its value to 
any function without affecting the value computed by the circuit. In particular, setting 
Xi = g causes G to have value c, a constant. This substitution allows the elimination of 
G and at least one successor gate, and reduces / to /* € ^v"li > from which the lower 
bound follows. 

Thus, in all cases, Cn(/) > 2s' — 2. ■ 

The lower bounds given above are derived for two functions over the basis B2. The best 
circuit-size lower bound that has been derived for this basis is 3(n — 1). When the basis 
is restricted, larger lower bounds may result, as mentioned in the notes and illustrated by 
Problems 9.22 and 9.23. 

9.4 Lower-Bound Methods for Formula Size 

Since formulas correspond to circuits of fan-out 1, the formula size of a function may be much 
larger than its circuit size. In this section we introduce two techniques for deriving lower 
bounds on formula size that illustrate this point. Each leads to bounds that are quadratic or 
nearly quadratic in the number of inputs. The first, due to Neciporuk [230], applies to any 
complete basis. The second, due to Krapchenko [174], applies to the standard basis fl . 

To fix ideas about formula size, we construct a circuit of fan-out 1 for the indirect storage 
access function f^f> : B k+lK+L h-> B, where K = 2 k and L = 2 l : 

/isaK xk-l ■ ■ ■ , aso. y) = y\x m \ 

Here a is a fc-tuple, Xj = (xj t i—i,...,Xj t o) is an Z-tuple for < j < K — 1, and 

y = (j/l-i, • • • , yo) is an L-tuple. The value of / Ig ^ is computed by indirection; that is, 
the value of a is treated as a binary number with value \a\ that is used to select the |a|th 
/-tuple X\a\'> this, in turn, is treated as a binary number and its value is used to select the 
|£C|a| |th variable in y. 

A circuit realizing fi S ' A from multiple copies of the multiplexer (direct storage access 
function) /ii : B 2 +n i— > B is shown schematically in Fig. 9.5. This circuit uses I copies 
of /mux : B 2 +k i— > B and one copy of /mux : B 2 +l i— > B. The copies of /mux produce 
the |a|th Z-tuple, which is supplied to the copy of /mux to select a variable from y. Since, as 

(k) . £. 

shown in Lemma 2.5.5, the function /mux can De realized by a circuit of size linear in 2 , a 

(k I) 

circuit for / ISA can De constructed that is also linear in the size of its input. 

A formula for /jg A has fan-out of 1 from every gate. The circuit sketched in Fig. 9.5 has 
fan-out 1 if and only if the fan-out within each multiplexer circuit is also 1 . To construct a 
formula from this circuit, we first construct one for /mux- The total number of times that 
address bits appear in a formula for /mux determines the number of copies of the formula for 

(k) . (k I) 

/mux that are used in the formula for / ISA . A proof by induction can be developed to show 
that a formula for /mux can be constructed of size 32 p — 2 in which address bits occur 2(2 P — 1) 
times. (See Problem 9.24.) Since each occurrence of an address bit in / mU x corresponds to a 

(k) . / 

copy of the formula for / m ux> by choosing L = 2 = n and k the smallest integer such that 
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Figure 9.5 The schema used to construct a circuit of fan-out 1 for the indirect storage access 
function / IS ^ ■ 



K = 2 k > n/l we see that /jgjy has 2 l + l2 k + k = 0(n) variables and that its formula size is 
2(2* -l)L n (f, 



r(/v) 



ISA 

(0 

mux 



LEMMA 9.4. 1 Let! 1 

B satisfies the following bound: 



Lq ( /mux ) , which is 0(n 2 / log 2 n), as summarized in Lemma 9.4.1. 



n and k = [log 2 n/Z]. Then the formula size of '/ IS a 



Ln[f^)=0{n 2 /\o gl n) 



We now introduce Neciporuk's method, by which it can be shown that this bound for 

(k I) 

/iSA ' s optimal to within a constant multiplicative factor. 



9.4. 1 The Neciporuk Lower Bound 

The Neciporuk lower-bound method uses a partition of the variables X = (x\, x%, . . . , x n ) of 
a Boolean function f( n > : B n i-> B into disjoint sets Xi,X 2 , ■ ■ ■ , X p . That is, X = (Ji=i ^ 



and Xj PI Xj 



for i ^ j. The lower bound on the formula size of / is stated in terms of 



?"X((/)> < j < p, the number of subfunctions of/ when restricted to variables in Xj. 
That is, rXj (/) is the number of different subfunctions of/ in the variables in Xj obtained 
by ranging over all values for variables in X — Xj. 

We now describe Neciporuk's lower bound on formula size. We emphasize that the strength 
of the lower bound depends on which partition X\, X2, • ■ • , X p of the variables X is chosen. 
After the proof we apply it to the indirect storage access function. The method cannot provide a 
lower bound that is larger than 0{n 2 / log n) for a function on n variables. (See Problem 9.25.) 
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THEOREM 9.4. 1 For every complete basis there is a constant cq such that for every function 
y(n) . gn ^ g an/ j gyery partition of its variables X into disjoint sets X\,Xj, ■ ■ ■ ,X p , the 
formula size of f with respect to Q satisfies the following lower bound: 



p 
£fi(/)>cnVlog 



3=1 



Proof Consider T, a minimal circuit of fan-out 1 for /. Let Uj be the number of instances 
of variables in Xj that are labels for leaves in T. Then by definition £fi(/) = X)i=i n j- 
Let d be the fan-in of the basis fl. 

For each j, 1 < j < p, we define the subtree Tj of T consisting of paths from vertices 
with labels in Xj to the output vertex, as suggested by the heavy lines in Fig. 9.6. We 
observe that some vertices in such a subtree have one input from a vertex in the subtree Tj 
(called controllers — shaded vertices in Fig. 9.6) whereas others have more than one input 
from a vertex in Tj (combiners — black vertices in Fig. 9.6). Each type of vertex typically 
has inputs from vertices other than those in Tj, that is, from vertices on paths from input 
vertices in X — Xj . 

When the variables X — Xj are assigned values, the output of a controller or com- 
biner vertex depends only on the inputs it receives from other vertices in Tj . The function 
computed by a controller is a function of its one input y in Tj and can be represented as 
(a A y) © b for some values of the constants a and b. These constants are determined by 
the values of inputs in X — Xj . We assume without loss of generality that each chain of 
controllers with no intervening combiners is compressed to one controller. The combiner is 
also some function of its inputs from other vertices in Tj. Since the number of such inputs 
is as least 2, a combiner (with fan-in at most d) has at most d — 2 inputs determined by 
variables in X — Xj . 



Combiner 



Controller 




Figure 9.6 The subtree Tj of the tree T is identified by heavy edges on paths from input vertices 
in the set Xj = {xi, x^\. Vertices in Tj that have one heavy input edge are controller vertices. 
Other vertices in Ta are combiner vertices. 
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By Lemma 9.2.1, since Tj has rij leaves, the number of vertices with fan-in of 2 or more 
(combiners) is at most rij — 1. Also, by Lemma 9.2.1, Tj has at most 2(rij — 1) edges. Since 
Tj may have one controller at the output and at most one per edge, Tj has at most 2rij — 1 
controllers. 

The number of functions computed by a combiner is at most one of 2 since at most 
d — 2 of its inputs are determined by variables in X — Xj. At most four functions are 
computed by a controller since there are at most four functions on one variable. It follows 
that the tree Tj associated with the input variables in Xj containing tij leaves computes 
TXj different functions where r\\ j satisfies the following upper bound. This bound is the 
product of the number of ways that each of the controllers and combiners can compute 
functions. 

rx-(f) < 2 (rf " 2)( "^" 1) (V 2 "' -1 ^ < 2 {d+2)n J 

Thus, (d + 2)rij > log 2 rx (/)• Since Ln(f) = Yli=i n j> tne theorem holds for cq = 
l/(d + 2).l 

Applying Neciporuk's lower bound to the indirect storage access function yields the fol- 
lowing result, which demonstrates that the upper bound given in Lemma 9.4.1 for the indirect 
storage access function is tight. 

LEMMA 9.4.2 Let2 l = nandk= flog 2 (n/0~|. The formula size of f$'£ : B k+lK+L i-> B 
satisfies the following bound: 

L n ( /isa ) = n 



log 2 ; 



Proof Let p = K = 2 and let Xj contain Xj. If Xj contains other variables, these are 
assigned fixed values, which cannot increase rx ■(/)■ For < j < K — 1, set \a\ = j. 
f has at least 2 restrictions since for each of the 2 assignments to (yL-i, ■ ■ ■ , Vo) the 
restriction of / is distinct; that is, if two different such L-tuples are supplied as input, they 
can be distinguished by some assignment to Xj. Thus rx* (/) > 2 . Hence, the formula 

size of / IS a > Lq I /jgA ) > CfiKL, which is proportional to n 2 / log n. ■ 

9.4.2 The Krapchenko Lower Bound 

Krapchenko's lower bound applies to the standard basis Oo or any complete subset, namely 
{A, -i} and {V, -i}. It provides a lower bound on formula size that can be slightly larger than 
that given by Neciporuk's method. 

We apply Krapchenko's method to the parity function f^' : B n (— > B, where f^ [xu %2> 
. . . , x n ) = X\ © X2 © • ■ ■ ® x n , to show that its formula size is quadratic in n. Since the parity 
function on two variables can be expressed by the formula 



(2) 

/e (xi,x 2 ) = {x\ Ax 2 ) V (xi Ax 2 ) 



it is straightforward to show that the formula size of /^ is at most quadratic in n. (See 
Problem 9.26.) 
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DEFINITION 9.4. 1 Given two disjoint subsets A, B C {0, 1}™ of the set of the Boolean n-tuples, 
the neighborhood of A and B, M{A, B), is the set of pairs of tuples (x, y), x G A and 
y G B, such that x and y agree in all but one position. 

The neighborhood of A = {0} and B = {1} is the pair Af( A, B) = {(0,1)}. Also, 
the neighborhood of A = {000,101} and B = {111,010} is the set of pairs Af( A, B) = 
{(000,010), (101, 111)}. 

Given a function / : B n t— > B, we use the notation / _1 (0) and / _1 (1) to denote the sets 
of n-tuples that cause / to assume the values and 1 , respectively. 

THEOREM 9.4.2 For any f : B n h-> B and any A C / _1 (0) and B C f~ l {\), the following 
inequality holds over the standard basis Qq: 

Proof Consider a circuit for / of fan-out 1 over the standard basis that has the mini- 
mal number of leaves, namely Lq (J). Since the fan-in of each gate is either 1 or 2, by 
Lemma 9.2.1 the number of leaves is one more than the number of gates of fan-in 2. Each 
fan-in-2 gate is an AND or OR gate with suitable negation on its inputs and outputs. 

Consider a minimal formula for /. Assume without loss of generality that the formula 
is written over the basis {A, -i}. We prove the lower bound by induction, the base case 
being that of a function on one variable. If the function is constant, \N{A, B)\ = and 
its formula size is also 0. If the function is non-constant, it is either x or x. (If f(x) = x, 
f-\l) = {1} and / _1 (0) = {0}.) In both cases, \Af{A,B)\ = 1 since the neighborhood 
has only one pair. (In the first case Af(A, B) = {(0,1)}.) Also, \A\ = 1 and \B\ = 1, 
thereby establishing the base case. 

The inductive hypothesis is that Ln (f*) > \Af(A,B)\/\A\\B\ for any function /* 
whose formula size Lq (/*) < Lq — 1 for some Lq > 2. Since the occurrences of NOT 
do not affect the formula size of a function, apply DeMorgan's theorem as necessary so that 
the output gate of the optimal (minimal-depth) formula for / is an AND gate. Then we can 
write / = g A h, where g and h are defined on the variables appearing in their formulas. 
Since the formula for / is optimal, so are the formulas for g and h. 

Let A C /-!(0) and B C f~ l (l). Thus, f{x) = for x G A and f(x) = 1 for 
x G B. Since / = g A h, if f{x) = 1, then both g(x) = 1 and h(x) = 1. That is, 
/ _1 (!) Q 9 _1 (!) and /"'(l) C h~\i). (See Fig. 9.7.) It follows that B C g~\\) and 
B C h~\\). Let Si = B 2 = B. Let Aj = Ang-'(O) (which implies A , C g' 1 ^)) 
and let A% = A — A\. Since f(x) = for x G A, but g(x) = 1 for x G A 2 , as suggested 
in Fig. 9.7, it follows that A 2 C /i _1 (0). (Since / = g A h, f(x) = 0, and g(x) = 1, it 
follows that h(x) = 0.) Finally, observe that N(A\,B{) and Af(A 2 , B 2 ) are disjoint (A\ 
and A 2 have no tuples in common) and that \N(A,B)\ = ^(A^B^ + \M(A 2 ,B 2 )\. 

Given the inductive hypothesis, it follows from the above that 



W/) = W<?) + wo > 



\M{A lt B{)\ 2 , \M{A 2 ,B 2 ) 



\M\Bx\ \A 2 \\B 2 \ 

1 ( lAfjAuBJl 2 \Af(A 2 ,B 2 )\ 2 
\B\ \ \A,\ \A 2 \ 
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Figure 9.7 The relationships among the sets / '(l),g l (l),h l (l), A 2 , and h 1 (0). 



By the identity n\/a,\ + n\/a2 > (ri\ + n 2 ) 2 /{a x + ax), which holds for positive integers 
(see Problem 9.3), the desired result follows because |A| = \A\\ + l^l- ■ 

Krapchenko's method is easily applied to the parity function /^ . We need only let A 
(B) contain n-tuples having an even (odd) number of Is. (\A\ = \B\ = 2™ _1 .) Then 
\Af(A, B)\ = nl n ~ l because for any vector in A there are exactly n vectors in B that are 



neighbors of it. It follows that Ln I /< 



r (n) 



> n 1 



9.5 The Power of Negation 



As a prelude to the discussion of monotone circuits for monotone functions in the next sec- 
tion, we consider the minimum number of negations necessary to realize an arbitrary Boolean 
function / : B n i— ► B m . From Problem 2.12 on dual-rail logic we know that every such 
function can be realized by a monotone circuit in which both the variables X\, X2, ■ ■ ■ , x n and 
their negations X\,Xz, ■ ■ ■ ,X n are provided as inputs. Furthermore, every such circuit need 
have only at most twice as many AND and OR gates as a minimal circuit over Qq, the standard 
basis. Also, the depth of the dual-rail logic circuit of a function is at most one more than the 
depth of a minimal-depth circuit, the extra depth being that to form X\,X2, . . . ,X n . 

Let /neg : S " ^ B n be defined by f^ G (xi,X2,--.,x n ) = (x u x 2 , . . . ,x n ). As 
shown in Lemma 9.5.1, this function can be realized by a circuit of size Oin logn) and 
depth (9(log n) over ilo using [log 2 (n + 1)] negations. This implies that most Boolean 
functions on n variables can be realized by a circuit whose size and depth are within a factor of 
about 2 of their minimal values when the number of negations is \\og 2 (n + 1)] . 

THEOREM 9.5. 1 Every Boolean function on n variables, f : B n 1— > B m , can be realized by a 
circuit containing at most [log 2 (n + 1)] negations. Furthermore, the minimal size and depth of 
such circuits is at most 2Cn a (f) + 0(n 2 logn) and Z?n (/) + 0(log n), respectively, where 
Cn (/) dnd Dn Cl (f) are the circuit size and depth of f over the standard basis ilo- 
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Proof The proof follows directly from the dual-rail expansion of Problem 2.12 and the 
following lemma. ■ 

We now show that the function /^eq '■ B n i-^ B n defined by f^ G (xi,X2, ■ • ■ , x n ) = 
(x\,X2, ■ ■ ■ ,x n ) can be realized by circuit size of 0(n 2 logn) over f2o using \log 2 (n + 1)] 
negations. 

LEMMA 9.5.1 /jSjeg : B n i— > B n can be realized with |~log 2 (n + 1)] negations by a circuit over 
the standard basis that has size 0(n 2 log n) and depth 0(log n). 

Proof The punctured threshold function r t " - : B n \— > B, 1 < t, i < n, is defined below. 



otherwise 



TRW 



This function has value 1 if t or more of the variables other than Xi have value 1 . The 
standard threshold function r t : B n i— > B has value 1 when t or more of the variables 
have value 1. Since the function (t _,$, r, _^, . . . , t„_j _,j) is the result of sorting all but 
the ith input, we know from Theorem 6.8.3 that Batcher's bitonic sorting algorithm will 
produce this output with a circuit of size 0(n log n) and depth <9(log n) because max 
and min of a comparator unit compute AND and OR on binary inputs. Ajtai, Komlos, and 
Szemeredi [14] have improved this bound to 0(n\ogn) but with a very large coefficient, 
and simultaneously achieve depth O(logri). Thus, all the functions {t 4 -^ \1 < t,i < n} 
can be realized with 0(n 2 log n) gates and depth 0(log n) over Q . 

Observe that for input x there is some largest t, t = to, such that r t \x) = 1. If 

T t _,j(a;) = 1, then xi = 0; otherwise, Xi = 1. Let the implication function a =$■ b 
have value 1 when a = or when a = 1 and 6=1 and value otherwise. Then we 
can express the implication function by the formula (a =£> b) = a V b. It follows that 
Xi = (r t (a;) => r t _,j(ai)) because the implication function has value 1 exactly when 
Xi = 0. 

We use an indirect method to compute to. Since r t (x) = for t > t , (r t (x) =$■ 
T t ^(a;)) = 1 for i > i - Also, both r t ' n (a;) and T t ^(aj) have value 1 for i < i - Using 
(a; =>• J/) = x V y, we can write ^ as follows: 



^ = ( T( f n) (») V £(,)) A ( Tl (w) («:) V r£> («)) A •• • A (r^ («) V r^ ,^( 8 



The circuit design is complete once a circuit for {r t (x) | 1 < f < n} has been 
designed. We begin by using a binary sorting circuit that computes {r t (a;) | 1 < i < n} 
from a;, which, as stated above, can be computed with 0(n log n) gates over the standard 
basis. Let St = T t (x) for 1 < t < n. 

For n = K — 1, K = 2 and fc an integer, we complete the design by constructing 
a circuit for the function i/ •* : B™ i— > B™, which, given as input the decreasing sequence 
Si, S2, ■ ■ ■ , s n (si > Si + i), computes as its jth output Zj = Sj, 1 < j < n. (The case 

n ^ 2 k — 1 is considered below.) That is, j/ fe )(s) = z, where Zj = T t (x). We give 
a recursive construction of a circuit for u'- ' whose correctness is established by induction. 
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z l z {K/2)-l z (K/2) z (K/2) + \ z K-\ 




S(K/i)-l 



Figure 9.8 A circuit for i/ 

3 as input, where Sj > s 3 +i for 1 < j < n, and produces as output z, where Zj — Sj 



SK-\ 



It is given the sorted n-tuple 



The base case is a circuit for v">. This circuit has one input, S\, and one output, Z\ = Si, 
and can be realized by one negation and no other gates. 

We construct a circuit for z/ ' from one for z/' ' using In additional gates and in- 
creasing the depth by three, as shown in Fig. 9.8. Let the inputs and outputs to the circuit 
for i/* -1 ) be s* and z* , 1 < i < K* - 1, where K* = K/2. It follows that s* > s* +l 
for 1 < i < {K/2) — 1. By induction z* = s*i for 1 < i < n. 

To show that the jth output of the circuit for v^ ' is Zj = Sj, we consider cases. If 
s 2 k-i = 0, then Sj = for j > K/2. In this case the jth circuit output, (K/2) < j < 
K — 1, satisfies Zj = 1 (the corresponding output gate is OR), which is the correct value. 



Also, for 1 < j < [K/2) - 1, Zj 



Sj since the inputs to the circuit for z/ ' are 



Si,s 2 , ■ 



■>{K/2) 



_i (sj = for j > K/2) and its outputs are S\,S2, 



HK/2) 



.1. On 



the other hand, if sk/ 2 — 1> then Sj = 1 and Zj = for j < (K/2) — 1 (the corresponding 



output gate is AND). Also, for (K/2) + 1 < j < K - 1, 
the circuit for z/ fe_1 ) are s 



- z* = Sj since the inputs to 

{K/2)+\, ■ • ■ > s K -\ and its outputs are s {K/2)+1 , . . . , s ( k/2)-i- 

It follows that k = log 2 (n + 1) negations are used. The circuit for z/ ' uses a total of 

C(k) = C(k — 1) + 2 fc+1 — 3 gates, where C(l) = 1. The solution to this recurrence 

is C(k) = 4(2 k ) — 3k — 4 = An — 31og 2 n — 4. Also, the circuit for v^ has depth 
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D(k) = D(k— 1)+4, where D(l) = 0. The solution to this recurrence is D( k) = 4(fc— 1). 
If n is not of the form 2—1, we increase n to the next largest integer of this form, which 
implies that k = [log 2 (n + 1)] . Using the upper bounds on the size of circuits to compute 
7t-,i(sc) for 1 < t,i < n, we have the desired conclusion. ■ 



9.6 Lower-Bound Methods for Monotone Circuits 

The best lower bounds that have been derived on the circuit size over complete bases of Boolean 
functions on n variables are linear in n. Similarly, the best lower bounds on formula size that 
have been derived over complete bases are at best quadratic in n. As a consequence, the search 
for better lower bounds has led to the study of monotone circuits (their basis is f2 mon ) for 
monotone functions. In one sense, this effort has been surprisingly successful. Techniques 
have been developed to show that some monotone functions have exponential circuit size. 
Since most monotone Boolean functions on n variables have circuit size 0(2"/n 3 ' ), this is 
a strong result. On the other hand, the hope that such techniques would lead to strong lower 
bounds on circuit size for monotone functions over complete bases has not yet been realized. 

Some monotone functions are very important. Among these are the clique function 
/clique,*: ' B n ( n ~ 1 ^ 2 h- > B. / cl "q UG k is associated with a family of undirected graphs 
G= {V,E) onn= \V\ vertices and \E\ < n(n - l)/2 edges, where V = {1, 2, 3, . . . , n}. 
The variables of /clique fc are denoted {Xi,j | 1 < i < j < n}, where Xij = 1 if there is an 
edge between vertices i and j and Xij = otherwise. The value of f^d ue k on these variables 
is 1 if G contains a fc-clique, a set of fc vertices such that there is an edge between every pair of 
vertices in the set. The value of f^d ue k is otherwise. Clearly f^' k is monotone because 
increasing the value of a variable from to 1 cannot decrease the value of the function. 

As stated in Problem 8.24, the CLIQUE problem is NP-complete. Since an instance of 
CLIQUE on a graph with n vertices can be converted to the input format for / c j"q UC k in time 

polynomial in n, if the circuit size for f^' k over a complete basis can be shown to be 
superpolynomial, then from Corollary 3.9.1, P 7^ NP. 

There are important similarities and differences between monotone and non-monotone 
functions. Every non-monotone function can be realized by a circuit over the standard basis 
SZo in which negations are used only on inputs. (See Problem 9.11.) On the other hand, since 
circuits without negation compute only monotone functions (Problem 2), negations on inputs 
are essential. 

The first results showing the existence of monotone functions such that their monotone 
and non-monotone circuit sizes are different were obtained for multiple-output functions. We 
illustrate this approach below for the n-input binary sorting function, / s " rt , whose monotone 
circuit size is shown to be Q(nlogn). As stated in Problem 2.17, this function can be realized 
by a circuit whose size over Qq is linear in n. 

We introduce the path method to show that a gap exists between the monotone and non- 
monotone circuit size of a family of functions. In Section 9.6.3 the approximation method 
is introduced and used to show that the clique function / c j i( L c k has exponential monotone 
circuit size. 
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9.6.1 The Path-Elimination Method 

In this section we illustrate the path-elimination method for deriving lower bounds on circuit 
size for monotone functions. This method demonstrates that a path of gates in a monotone 
circuit can be eliminated by fixing one input variable. Thus, it is the monotone equivalent 
of the gate-elimination method for general circuits. We apply the method to two problems, 
binary sorting and binary merging. 

Consider computing the binary sorting function / sort : B n i— > B n introduced in Sec- 
tion 2.11. This function rearranges the bits in a binary ri-input string into descending order. 
Thus, the first sorted output is 1 if one or more of the inputs is 1, the second is 1 if two or more 
of them are 1, etc. Consequently, we can write / sort (xi, x%, . . . , x n ) = (t\ , t 2 , . . . , t„ ), 
where r t is the threshold function on n inputs with threshold t whose value is 1 if t or more 
of its inputs are 1 and otherwise. Ajtai, Komlos, and Szemeredi [14] have shown the exis- 
tence of a comparator-based sorting network on n inputs of size 0(nlogn). (The coefficient 
on this bound is so large that the bound has only asymptotic value.) Such networks can be 
converted to a monotone network by replacing the max and min operators in comparators 
with OR and AND, respectively. 

THEOREM 9.6. 1 The monotone circuit size for / s " rt satisfies the following bounds: 

n\hg 2 n] - 2^™! < Cn mOD (/j&) = O(nlogn) 

Proof To derive the lower bound, we show that in any circuit for / s ™ rt there is an input 
variable that can be set to 1 , thereby allowing at least [log 2 n\ gates along a path from it to 
the output Tj to be removed from the circuit and converting the circuit to one for / s ™ rt . 
As a result, we show the following relationship: 

C fimon (/<&) > Ca_ (/j£ 1} ) + [log 2 nl 
A simple proof by induction and a little algebra show that the desired result follows from 

(2) 

this bound and the fact that Cf2(/ SO rt) = 2> which is easy to establish. 

Let Xj = for j =/= i but let Xi vary. The only functions computed at gates are 0, 1, or 
Xi- Also, the value of T\(x) on such inputs is equal to X{. Consequently, there must be a 
path P from the vertex labeled Xi to t\ such that at each gate on the path the function Xt is 
computed. (See Fig. 9.9.) Thus, if we set x% = 1 when Xj = for j =/= i the output of each 
of these gates is 1. Furthermore, since the circuit is monotone, each function computed at a 
gate is monotone (see Problem 2). Thus, if any other input is subsequently increased from 
to 1 , the value of T\ and of all the gates on the path P from x i remain at 1 and can be 
removed. This setting of Xi also has the effect of reducing the threshold of all other output 
functions by 1 and implies that the circuit now computes the binary sorting function on one 
fewer variable. 

Consider a minimal monotone circuit for / sort . The shortest paths from each input to 
the output Tj form a tree of fan-in 2. From Theorem 9.3.1 there is a path in this tree from 
some input, say x r , to t\ that has length at least [log 2 n~\ . Consequently the shortest path 
from x r to Tj has length at least [log 2 n~\ , implying that at least [log 2 n\ gates can be 
removed if x r is set to 1 . ■ 
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X\ X 2 Xi — 1 x n 

Figure 9.9 When Xi = 1 there is a path P to Ti such that each gate on P has value 1. 



We now derive a stronger result: we show that every monotone circuit for binary merging 
has a size that is Q,{n log n). Binary merging is realized by a function /merge : B n <— > B n , n = 
2k, defined as follows: given two sorted binary fc-tuples x and y, the value of /merge {x, y) 
is the n-tuple that results from sorting the n-tuple formed by concatenating x and y. Thus, 
a binary merging circuit can be obtained from one for sorting simply by restricting the values 
assumed by inputs to the sorting circuit. (Binary merging is a subfunction of binary sorting.) 

It follows that a lower bound on Cn mon ( /merge ) is a lower bound on Cn mon ( / sort ) • 

THEOREM 9.6.2 Let n be even. Then the monotone circuit size for /merge : B n i— > B n satisfies 
the following bounds: 

(n/2) log 2 n-0(n) < C 0mon (/i" erg e) = O(nlogn) 

Proof The upper bound on Co mon ( /merge ) follows from the construction given in The- 
orem 6.8.2 after max and min comparison operators are replaced by ANDs and ORs, respec- 
tively. 

Let k = n/2. The function /merge operates on two fc-tuples x and y to produce the 
merged result fmcvge(x,y), where x and y are in descending order; that is, X\ > x 2 > 
■ ■ ■ > Xk and j/i > 1/2 > ■ ■ ■ > Vk- As stated above for binary sorting, the output functions 
areri,T 2 ,. . .,r n . 

Let xi = x 2 = ■■■ = 2V- 1 = L x r +\ = ■ ■ ■ = Xk = 0, j/i = y 2 = ■ ■ ■ = y s = 1. 
and y s +i = • ■ ■ = Vk — 0. Let x r be unspecified. Since the circuit is monotone, the value 
computed by each gate circuit is 0, 1, or x r . Also, 



n(x,y) 



1 t < r + s 

x r t = r + s 
t> r+s 



It follows that there must be a path Pr of gates from the input labeled x r to the 

output labeled T r+S such that each gate output is x r . If x r = 0, since the components of a; 
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Figure 9.10 Let /merge (as, y) = (ti, . . . ,r n ), where as and y are (n/2)-tuples. The dots in 
the Jth row show the inputs on which Tj depends. e( j) is the number of dots in the Jth row. 



are sorted, x r +i = ■ ■ ■ = Xk = 0. On the other hand, if x r = 1, by monotonicity the value 
of T r+S cannot change under variation of the values x r +\, . . . , Xk- Thus, Tj is essentially 
dependent on Xi for i and j satisfying 1 < i < k and i < j < i + k. (See Fig. 9.10.) Let 
e( j) denote the number of variables in x on which Tj depends; then e( j) = j for j < k 
and e( j) = 2k — j + 1 for j > k. 

We show by induction that there exist vertex-disjoint paths between X\ and t s+ \, X2 
and t s +2, ■ ■ ■ ,Xk and r s+ fc for < s < k. (See Fig. 9.11.) Thus, there are k + 1 sets of 
vertex-disjoint paths connecting the k = n/2 inputs in x and k consecutive outputs. 



T\ T 2 T 3 T 4 T 5 T 6 T 7 



Tg 






Ki x 2 x 3 x 4 yi y 2 J/3 j/4 



Xi X2 



a;i x 2 x$ X4 



(a) 



(b) 



f (n) 



Figure 9. 1 I (a) In a monotone circuit for /merge, fl = 2fc, £;+ 1 sets of fc disjoint paths exist be- 
tween the k inputs X and k consecutive outputs, (b) The paths to an output Tj form a binary tree. 
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To show the existence of the vertex-disjoint paths, let y\ = yi = • • • = y s = 1, 
y s +i = • • • = yu — and X\ = xx = ■ ■ ■ = x r -\ = 1, but let x r , x r+ \, . . . ,Xk be 
unspecified. Then T r+S = x r and, as stated above, there is a path P r of gates from an 

input labeled x r to the output labeled T r+S such that each gate has value x r . Set x r = 1. 
Reasoning as before, there must be a path P,,: +1 of gates from an input labeled x r+i to 

the output labeled T r+ i +s such that each gate has value x r+ \ . Thus, -P r+ i an d -Pr 

are vertex-disjoint. Extending this idea, we have the desired conclusion about disjoint paths. 
We now develop a second fact about these paths that is needed in the lower bound. Let 
Pr be a path from x r to T r+S , as suggested in Fig. 9.11(a). Those paths connecting 

inputs to any one output form a binary tree, as suggested in Fig. 9.11(b). The number of 
inputs from which there is a path to Tj is e( j), the number of inputs on which Tj depends. 

To derive the lower bound on Cn mon (/merge), let d(i,j) denote the length (number 
of edges or non-input vertices (gates)) on the shortest path from an input labeled Xi to the 
output labeled Tj. (Clearly, d(i,j) = unless i < j < i + k.) Since the path from 
input Xi to output Tj described above has a length at least as large as d(i,j), it follows that 

Co mon ( /merge ) satisfies the following bound: 

Cn mon (/mlrge) > max I J2 d(r, r + s) \ < s < k \ 

Since the maximum of a set of integers is at least equal to the average of these integers, we 
have the following for k = njl > 1 : 

k k 2k k 

cn moa (/mle) > ^r E E d ^ r + *) = fcir £ E d ^ j) 

s=0 r=l j = \ i=l 

The last identity follows by using the fact that d(i,j) = unless i < j < i + k. But 
yV_ t d(i, j) is the sum of the distances of the shortest paths from the relevant inputs of a; to 
output Tj, \ < j < 2k. Since these paths form a binary tree and Tj depends on e( j) inputs, 
this is the external path length of a tree with e( j) leaves. The external path length is at least 
e( j)\log 2 e(j)~\-2^ e( -^+e( j) (see Problem 9.4). In turn, .x[log 2 x] -2^°^ »1 + x > 
x log 2 x, because [log 2 x~\ = (log 2 x) + 5 for < 5 < 1 and x[log 2 x~\ — 2^ og2 x ^ + x = 
x log 2 x + a;(l — 2 +5), where 1 — 2 + S is easily shown to be a concave function whose 
minimum value occurs at either 6 = or 5 = 1, both of which are 0. Thus, 1 — 2 + d > 
and the result follows. Thus, the size of smallest monotone circuit satisfies the following 
lower bound when n = 2k: 

Ik 
Cn mon (/mle) > ^E^log^U)] 

j = l 

2 k 

= ^rEb-iog 2 ^ 

The last equality uses the definition of e( j) given above. By applying the reasoning in 
Problem 2.1 and captured in Fig. 2.23, it is easy to show that the above sum is at least as 
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large as (2/(fc + l))(log 2 e) J._. ylog e y dy, whose value is (2/(k+ l))[(k 2 /2) log 2 k — 
(l/4)A; 2 (log2 e) + 1/4]. From this the desired conclusion follows, since k = n/2. m 

We now present lower bounds on the monotone circuit size of Boolean convolution and 
Boolean matrix multiplication, problems for which the gap between the monotone and non- 
monotone circuit size is much larger than for sorting and merging. 

9.6.2 The Function Replacement Method 

The function replacement method simplifies monotone circuits by replacing a function com- 
puted at an internal vertex by a new function without changing the function computed by the 
overall circuit. Since a replacement step eliminates gates and reduces a problem to a subprob- 
lem, the method provides a basis for establishing lower bounds on circuit complexity using 
proof by induction. 

We describe two replacement rules and then apply them to Boolean convolution and 
Boolean matrix multiplication. These two problems are defined in the usual way except that 
variables assume Boolean values in B and the multiplication and addition operators are inter- 
preted as AND and OR, respectively. 

REPLACEMENT RULES A replacement rule is a rule that allows a function computed at a vertex 
of a circuit to be replaced by another without changing the function computed by the circuit. 
Before stating such rules for monotone functions, we introduce some terminology. 

DEFINITION 9.6. 1 Letx denote the variables of a Boolean function f : B n i— > B. An implicant 

of f is a product (AND), n, of a subset of the literals of f (the variables and their complements) 
such that if tt(x) = 1 on input n-tuple x, then fix) = 1. (This is denotedir < f.) The set of 
implicants of a function f is denoted 1(f). 

An implicant it of a Boolean function f is a prime implicant if there is no implicant w\ 
different from it such that it < 7Tj < /. The set of prime implicants of a function f is denoted 

PI(f)- 

A monotone implicant (also called a monomj of a monotone Boolean function f : B n t— > B 
is the product (ANDj n of uncomplemented variables of f such that if -k[x) = 1 on input n-tuple 
x, then f(x) = 1. The empty monom has value 1. The set of monotone implicants of a 
function f is denoted I mon (/). 

A monotone implicant n of a Boolean function f is a monotone prime implicant if there is 
no monotone implicant 7i"i different from it such that it < tt\ < f. The set of monotone prime 
implicants of a function f is denoted PI mon (f). 

The products in the sum-of-products expansion (SOPE) are (non-monotone) implicants 
of a Boolean function. If a function is monotone, it has monotone implicants (monoms). The 
prime implicants of a Boolean function / define it completely; the OR of its prime implicants 
is a formula representing it. In the case of a monotone Boolean function, the prime implicants 
are monotone prime implicants. (See Problem 9.33.) 

When it is understood from context that an implicant or prime implicant is monotone, 
we may omit the word "monotone" and use the subscript "mon." This will be the case in this 
section. 
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The function Cj+i = (pj A cf) V gj used in the design of a full adder (see Section 2.7) 
is a monotone function of the variables Pj, Cj, and gj. Its set of implicants is I(cj+\) = 
{pj A Cj , gj , pj A gj , Cj A gj , pj A Cj A gj}. If any one of these products has value 1 then so 
does Cj+i. Its set of prime implicants is PI(cj+\) = {pj A Cj,gj} C I(cj+\) because these 
are the smallest products for which Cj + \ has value 1. Thus, Cj+\ is defined by PI(cj + \) and 
represented as Cj+i = (pj A Cj) V gj. 

We now present a replacement rule for monotone functions that captures the following 
idea: if a function g computed by a gate of a monotone circuit has a monom it that is not a 
monom of the function / computed by the complete circuit, then it can be removed from g 
without affecting the value of /. This idea is valid in monotone circuits because the absence 
of negation provides only one way to eliminate extra monoms, namely, by ORing them with 
products containing a subset of their variables. Taking the AND of a monom with another 
term creates a longer monom. Thus, since monoms that are not monoms of the function / 
computed by a circuit must be eliminated, there is no loss of generality in assuming that they 
are not produced in the first place. 

DEFINITION 9.6.2 Let f : B n i— > B and g : B n i— > B be two monotone functions. Let g be 
computed within a monotone circuit for f. The following is a replacement rule for g: 

a) LetTTi G PI{g) andleth be defined by PI (h) = PI(g) — {tt}- Replace the gate computing 
g by one computing h if for all monoms n' (including the empty monom), tt A tt' PI(f). 

We now show that any monom tt satisfying Rule (a) can be removed from PI(g) because 
it contributes nothing to the computation of/. 

LEMMA 9.6. 1 Let f : B n i— > B and g : B n h- > B be two monotone functions and let tt € PI(g) 
be such that for all monoms tt' (including the empty monom), tt A tt' PI(f). Let h be defined 
by PI(h) = PI(g) — {7r}. Lf g is computed in some monotone circuit for f, the circuit obtained 
by replacing g by h also computes f. 

Proof Let C denote a circuit for / within which the function g is computed. Let C* be 
the circuit obtained by replacing g by h under Rule (a). Since h < g and the circuit is 
monotone, the function /* computed by C* satisfies /* < /. We suppose that /* =£ f and 
show that a contradiction results. 

If/* ¥" /> tnere i s some input n-tuple a € B n such that f*(a) = but /(a) = 1. 
Since the only change in the circuit occurred at the gate computing g, by monotonicity, on 
this tuple g(a) = 1 but h{a) = 0. It follows that 7r(a) = 1. Let it' be a prime implicant of 
/ for which 71" (a) = 1. We show that it' = it A 1T\ for some monom tt\, in contradiction 
to the condition of the lemma. 

Let Xi be any variable of it. Then a^ = 1 since vr(a) = 1. Define the n-tuple b by 
bi = and bj = a,j for j ^ i. Since b < a and 7r(6) = 0, h and g both have the same 
value on b. Thus, both circuits compute the same value, which must be by monotonicity 
and the fact that /* = on a. Since it (a) = 1 and it (b) = but only one variable was 
changed, namely cCj, IT must contain a;,. Since Xi is an arbitrary variable of 7r, it follows 
that 7t' contains 7r as a sub-monom. ■ 

This last result implies that if a function / has no prime implicants containing more than 
I variables, then any monoms containing more than I variables can be removed where they 
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are first created. This will be useful later when discussing Boolean convolution and Boolean 
matrix multiplication, since each of their prime implicants depends on two variables. 

BOOLEAN CONVOLUTION Convolution over commutative rings is defined in Section 6.7. In 
this section we introduce the Boolean version, which is defined by a monotone multiple-output 
function, and derive a lower bound of n ' 2 on its monotone circuit size. We also show that 
over a complete basis Boolean convolution can be realized by a circuit of nearly linear size. 

DEFINITION 9.6.3 The Boolean convolution function f^L : B 2n h-> B 2 ™" 1 maps Boolean 
n-tuples a = [clq, <X\, . . . , a n -i) and b = (bo, b\, ... , 6 n -i) onto a (2n — \)-tuple c, denoted 
c = a®b, where Cj, < j < In — 1, is defined as 

Cj = 2_, a r A b s 

Boolean convolution can be realized by a circuit over the standard basis CIq for multiplying 
binary numbers (see Section 2.9) as follows. Represent a and b by the following integers where 
q= flog 2 n] + 1: 

n— 1 n—\ 

a=J2 ^ 2qi > b = Yl b i 2<IJ 

i=0 j=Q 

That is, each bit in a and b is separated by [log 2 n~\ zeros. The formal product of a and b is 

ab =fl I E «Ap fc 

fe=0 \i+j = k J 

Because no inner sum in the above expression is more than 2n — 1, at most q bits suffice to 
represent it in binary notation. Consequently, there is no carry between any two inner sums. 
It follows that an inner sum is non-zero if and only if Cfc = 1 . Thus, the value of Cfe can be 
obtained by forming the OR of the bits in positions kq, kq + 1 , . . . , kq + q — 1 of the product. 
Since two binary m tuples can be multiplied in the standard binary notation by a circuit of 
size O (m(logm)(loglogm)) (see Section 2.9.3), the function / c ™nv can be computed by a 
circuit of size O (n(log n)(loglogn)) since m = nq = 0(nlogn). 

THEOREM 9.6.3 The circuit size of fmnv '■ B 2n i— > B ln ~ l over the standard basis satisfies 

Cn (f£l) = O (n(log 2 n)(loglogn)) 

Our goal is to use the function replacement method to show that every monotone circuit 
for Boolean convolution has size fi(n 3 ' 2 ). As explained above, the method is designed to 
use induction to prove lower bounds on monotone circuit size. Each replacement step removes 
prime implicants from the function g computed at some gate and changes the function / com- 
puted by the circuit. If the new function /* is in the same family as /, the gate-replacement 
process can continue and induction can be applied. Since the convolution function does not 
necessarily change to another instance of itself on fewer variables, we place this function in the 
class of semi-disjoint bilinear forms. 
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DEFINITION 9.6.4 Let f( n - m >P) = (f l ,f 2 ,..., f p ), where each f r : B n+m i-> B, 1 < r < p, 
is a monotone function on n-tuple x and m-tuple y; that is, f r {x, y) £ B. f( n <™-P> is a bilinear 
form if each prime implicant of each f r , 1 < r < p, contains one variable ofx and one of y. 
A function f( n < m <P) i$ a semi-disjoint bilinear form if in addition PI(f r ) f~l PI(f s ) = for 
r ^ s and each variable is contained in at most one prime implicant of any one function. 

Before deriving a lower bound on the number of gates needed for a semi-disjoint bilinear 
form, we introduce a new replacement rule peculiar to these forms. 

LEMMA 9.6.2 No gate of a monotone circuit of minimal size for a semi-disjoint bilinear form 
j(n,m,p) com p utes a function g whose prime implicants include either two variables of x or ofy. 

Proof We suppose that a minimal monotone circuit does contain a gate g whose prime 
implicants contain either two variables of x or two of y and show that a contradiction 
results. Without loss of generality, assume that PI{g) contains Xi and Xj, i =/= j. If there is 
a gate g satisfying this hypothesis, there is one that is closest to an input variable. This must 
be an OR gate because AND gates increase the length of prime implicants. Because the gate 
in question is closest to inputs, at least one of Xj and Xj is either an input to this OR gate or 
is the input to some OR gate that is on a path of OR gates to this gate. (See Fig. 9.12.) 

A simple proof by induction on its circuit size demonstrates that if a circuit for f( n < m <P) 
= (/i, . . . , f p ) contains a gate computing g then f r , 1 < r < p, can be written as follows 
(see Problem 9.36): 

fr(x, y) = (p r (x, y) A g(x, y)) V q r (x, y) (9.1) 

Here p r (x, y) and q r {x, y) are Boolean functions. Of course, if for no r is f r a function of 
g, then we can set p r (x, y) = and the circuit is not minimal. 

If f r depends on g, p r (x,y) =£ 0. However, p r (x,y) =£ 1 because otherwise both 
Xi and Xj are prime implicants of f r , contradicting its definition. Also, PI(p r (x,y)) 
cannot have any monoms containing one or more instances of a variable in x or two or 
more instances of variables in y because when ANDed with g they produce monoms that 
could be removed by Rule (a) of Definition 9.6.2 and the circuit would not be minimal. It 
follows that PI(p r (x, y)) can contain only single variables ofy. But this implies that for 
some k, yk A g € I(f r )> which together with the fact that Xi,Xj G PI(g) implies that 



A A A A 



X\ X2 X3X4 

Figure 9.12 If P 1(g) for a gate g contains x % and Xj, then either Xi or Xj is input to an OR 
gate on a path of OR gates to g. 
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Vk A Xi, l/k A Xj G I(f r ). But tjk A Xi and tjk A Xj cannot both be prime implicants of f r 
because they violate the requirement that no two prime implicants of f r contain the same 
variable. It follows that f r does not depend on g. ■ 

The Boolean convolution function is a semi-disjoint bilinear form. Each implicant of 
each component of c = a ® b contains one variable of a and one of b. In addition, the prime 
implicants of c, and c,j are disjoint ifi^=j. Finally, each variable appears in only one implicant 
of a component function, although it may appear in more than one such function. 

THEOREM 9.6.4 Letf n ' m -v) ■ B n+m i-> B p , f( n - m -P) = (f u f 2 , . . ., f p ), be a semi-disjoint 
bilinear form, where f r (x, y) G B. Let di be the number of functions in {/i, f 2 , ■ ■ ■ , fp} that 
are essentially dependent on the input variable Xi, 1 < i < n. Then the monotone circuit size of 
j(n,m,p) must sa tisfy the following lower bound: 

n 

Co mon (/("■"**)) >^2Vdi 

i=l 
Proof The proof is by induction. The basis for induction is the semi-disjoint bilinear form 
on two variables f^ l,l,x \x,y) = x A y. In this case d\ = 1 and Ca mon (j' 1,1,1 -*) = 1. 
We assume that any semi-disjoint bilinear form in n + m — 1 or fewer variables satisfies the 
lower bound. We show that setting Xi = produces another function that is a semi-disjoint 
bilinear form and allows the removal of at least \fdi gates. The lower bound follows by 
induction. We consider only minimal circuits. 

Let Ui denote the number of functions in {/i, f 2 , ■ ■ ■ , fp} that are essentially dependent 
on Xi and have a single prime implicant (such as c = a A bo and C2 n -2 = «n-i A b n —\ 
for convolution). Setting xi = eliminates the u% AND gates at which these outputs 
are computed. We show that at least \Jdi — Ui OR gates can also be eliminated. Since 
m + \/di — Ui > \fd~i (see Problem 9.8), we have the desired conclusion. 

Let Vi denote those outputs that depend on Xi whose associated function has at least 
two prime implicants. Then \Vi\ = di — U{. There must be at least one OR gate on each 
path P from Xi to f r G Vi because, if not, each path contains only ANDs and f r has only 
one prime implicant that contains Xi, in contradiction to the definition of Vi. 

We claim that on each path P from an input labeled Xi to some f r G Vi there is an 
OR gate computing a function g t such that Xi t A yj t G PI(gt) for some Xi t ^ Xi. Let 
E% = {9t} be those OR gates closest to an input vertex Xi. Call Ei the bottleneck for 
variable Xi. We shall show that \Ei\ > \/di — Ui and that each of the gates in Ei can be 
eliminated by setting Xi = 0. 

If the claim is false, then there is a path P from input Xi to output /,■ G Vi such that 
for each OR gate (let it compute gt) on P there is no Xi t ^ Xi such that Xi t A Vj t G 
PI(gt). Therefore, either all monoms of PI (gt) a) contain Xi or b) are monoms that are 
not implicants of an output (they are not of the form Xi t A yj t ). In case a), setting cc, = 
causes the OR gates on P to have value 0, which forces the AND gates on P and f r to 
have value 0, contradicting the definition of f r (it has at least two prime implicants). In the 
second case under Rule (a) the monoms not containing Xi can be removed without changing 
the functions computed. Thus, when Xi = 0, the output of each OR gate on P has value 0, 
which contradicts the definition of f r since it contains at least two prime implicants. 

We now show that \Ei\ > y/di — U~i. Since each of the OR gates in Ei has a prime 
implicant Xi t A yj t not containing Xi, their outputs can be set to 1 by setting Xi t = yj t = 1 
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for 1 < t < \Ei\. This eliminates all dependence of f r G Vi on cCj. However, since inputs 
have only been assigned value 1 (and not 0), this dependence on x% can be eliminated only 
if all functions in Vi have value 1; that is, at least one prime implicant of each of them is 
set to 1 by this assignment. Since each variable appears in at most one prime implicant of 
a function, the number of different variables x% t (and yi t ) that are set to 1 is at most \Ei\. 
Thus, at most \Ei\ 2 prime implicants can be assigned value 1 by this assignment. Thus, if 
\Ei\ 2 < (di — Ui), we have a contradiction since \Vi\ = (di — Ui). 

We now show that \Ei\ OR gates can be eliminated by setting Xi = 0. Since each gate is 
a closest gate to an input labeled Xi with the stated property, there is an OR gate on the path 
to it with Xi as an input. Thus, setting Xi = eliminates one of the two inputs to the OR 
gate and the need for the gate itself. ■ 

Since for each of the n input variables in a there are n output functions in c = a ® b that 
depend on it (di = n for 1 < % < n), the following corollary is immediate. 



COROLLARY 9.6.1 Let / c ™nv : B 2n i— > B 2n l be the Boolean convolution function. Then the 

ies the following lower bou 

Cn mon (ftt) > n 3/2 



monotone circuit size of / c "nv satisfies the following lower bound. 



Unfortunately, no upper bound on the monotone circuit size of fconv is known that 
matches this lower bound. A stronger statement can be made for Boolean matrix multipli- 
cation. 

BOOLEAN MATRIX MULTIPLICATION Matrix multiplication over rings is discussed at length in 
Section 6.3. In this section we introduce the Boolean version. An I x J matrix A = [a>i,j\, 
1 < i < / and 1 < J ' < J, is a two-dimensional array of elements in which Ojj is the element 
in the ith row and jth column. We take the entries in a matrix to be Boolean variables. 

DEFINITION 9.6.5 Let A = [a lM ], 1 < i < n and 1 < k < m, B = [bk,j], I < k < m and 
1 < j < p, andC = [cij], 1 < i < n and 1 < j < p, ben X m, m X p, andn x p matrices, 
respectively. The product C = A x B of A and B is the function f^M '■ B nm+mp i-> B np 
whose value on the matrices A and B is the matrix C whose entry in row i and column j, c^a, is 
defined as 



in 



V a lik A 6 fe>: 



^j — V l ' k feJ 

k = \ 

In a more general context the AND operator A and the OR operator V are replaced by the 
multiplication and addition operators over rings. 

The above definition can be used as an algorithm to compute Cjj, \ < i < n and 1 < 
j < p, from the entries in matrices A and B. We call this the standard matrix-multiplication 
algorithm. It uses nmp ANDs and n(m— \)p ORs. We now show that every monotone circuit 
for matrix multiplication requires at least this many ANDs and ORs. 

Clearly the matrix multiplication function is a bilinear form. We associate the entries in 
A with the tuple x and those in B with y. We strengthen Theorem 9.6.4 to obtain a lower 
bound on the number of ORs needed to realize it in a monotone circuit. 
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LEMMA 9.6.3 Every monotone circuit for Boolean matrix multiplication /^jj™ requires at least 
n(m — \)p OK gates. 

Proof In the proof of Theorem 9.6.4 we identified a set Ei of gates called the bottleneck 
associated with each input variable £Cj. We demonstrated that each of these gates can be 
eliminated by setting Xi = and that Ei has at least \/di — Ui gates, where di — Ui = \Vi\ 
is the number of circuit outputs that depend essentially on Xi and have at least two prime 
implicants. These results were shown by proving that all gates in E t are OR gates and that 
the ith of these gates' associated function contains a prime implicant of the form x it A yj t 
for xi t 7^ Xi. We then demonstrated that the dependence of the outputs in Vi on the input 
Xi can be eliminated by setting Xi t = y.j t = 1 for 1 < t < Ei but that this contradicts 
the definition of a semi-definite bilinear form if \Ei\ < \Vi\. Finally, we proved that by 
setting Xi = each of the gates in Ei could be eliminated. For this lemma, we need only 
strengthen the lower bound on Ei for matrix multiplication. 

Consider a minimal circuit. The proof is by induction on m, with the base case being 
m = 1. In the base case Cjj = a^.i A &ij for 1 < i < n and 1 < j < p and no ORs 
are needed. As inductive hypothesis we assume that f^M requires at least n{m — 2)p 

OR gates. We show that setting any column of A in fy^" P to eliminates np OR gates 

and reduces the problem to an instance of /mm ' • It follows that /^ M ' requires 
n(m — \)p OR gates. 

When m > 2, each output function aj has at least two prime implicants. We apply 
the bottleneck argument to this case. Consider the bottleneck E^k associated with input 
variable a^. We show that \E^\ > p, from which it follows that at least p OR gates can be 
eliminated by setting x^k = 0. This reduces the problem to another set of bilinear forms. 
Repeating this for 1 < i < n, we eliminate np OR gates, one column of A, and one row of 
B. Let Vij = {cjj | 1 < j < p] be the outputs that depend on Oj.fe. 

To show that \Ei^\ > P, let the th gate of En- compute Xi t A yj t for Xi t =/= a^k- 
Here Xi t = a.i ti k t and yj t = &; tJ( for some it, k t , It, and jt- If we set all entries in 
{ a i t ,k t | 1 < t < l-Ej^l} U {h t> j t | 1 < t < |i^i,fc|} to 1, we eliminate all dependence of 
outputs in Vi t k on a^fe. However, since \Vij\ = p, the set {bi t ,j t } must contain at least one 
variable used in Cij for each 1 < j < p. Thus, l-E^fcl > p. ■ 

We now derive a lower bound on the number of AND gates needed for Boolean matrix 
multiplication. 

LEMMA 9.6.4 Every monotone circuit for Boolean matrix multiplication fy^ ' requires at least 
nmp AND gates. 

Proof Consider a minimal circuit. The proof is by induction on m, the base case being 
m = 1. In the base case aj = a^i A b\j for 1 < i < n and 1 < j < p and np ANDs are 
needed, since np results must be computed, each requiring one AND, and all functions are 
different. As inductive hypothesis we assume that /^ ' requires at least n(m — \)p 
AND gates. We show that setting any column of A in fy^ to 1 and the corresponding 
row of B to eliminates np AND gates and reduces the problem to an instance of /j^™ • 
It follows that fyiyi ' requires nmp AND gates. 

For arbitrary 1 < k < m let Gij be a gate closest to inputs computing a function g 
such that PI(g) contains a^k A bkj- Since the gate associated with Cij has a^k A b^j as a 
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prime implicant, there is such a gate Gij. Furthermore, Gij must be an AND gate because 
OR gates cannot generate new prime implicants. Let G\ and G2 be gates generating inputs 
for Gij. Let them compute functions g\ and gx- It follows from the definition of G,j that 
di,k S PI (91) and bk,j € PI{gi) or vice versa. Let the former hold. Ifa^fc = 1, g\ = 1 
and Gij can be eliminated. We now show that Gij 7^ Gyji for (i,j) =/= (V ',]')■ Suppose 
not. Since i =/= %' or j =£ j' , there are at least three distinct variables among Otjfe, ffli'.fe. 6fc,j. 
and 6fej'. Therefore either gi or 52 has at least two of these variables as prime implicants. 
By Lemma 9.6.2 this circuit is not minimal, a contradiction. ■ 

We summarize the results of this section below. 

THEOREM 9.6.5 The standard algorithm for f^" p) : B nm+mp i-> B np , the Boolean matrix 
multiplication function, is optimal. It uses nmp AND.T andn(m — \)p ORs. 

We now show that the monotone circuit size of the clique function is exponential. 



9.6.3 The Approximation Method 

The approximation method is used to derive large lower bounds on the monotone circuit size 
for certain monotone Boolean functions. In this section we use it to derive an exponential 
lower bound on the size of the smallest monotone circuit for the clique function f^' k ■ 
gn(n-i)/2 ,__> g This method provides an interesting approach to deriving large lower bounds 
on circuit size. However, as mentioned in the Chapter Notes, it is doubtful that it can be used 
to obtain large lower bounds on circuit size over complete bases. 

The approximation method converts a monotone circuit C computing a function / into 
an approximation circuit C computing a function /. This is done by repeatedly replacing a 
previously unvisited gate farthest away from the output gate by an approximation gate that 
computes an approximation to the AND or OR gate it replaces. Each replacement operation 
changes the circuit and increases by a small amount the number of input tuples on which / 
and the function computed by the new circuit differ. When the entire replacement process is 
complete, the resulting circuit approximates / poorly; that is, / and / differ on a large number 
of inputs. For this to happen, the original monotone circuit must have had many gates, each 
of which contributes a relatively small number of errors to the complete replacement process. 
This is the essence of the approximation method. 

There are a number of ways to approximate AND and OR gates in a monotone circuit. 
Razborov [270], who introduced the approximation method, used an approximation for gates 
based on clique indicators, monotone functions associated with a subset of a set of vertices 
that has value 1 exactly when there is an edge between every pair of vertices in the subset. In 
this section gates are approximated in terms of the SOPE and POSE forms, a method used by 
Amano and Maruoka [20] to approximate the clique function. 

It is not hard to show that the monotone circuit size of f^' k is 0(n n ). (See Prob- 
lem 9.37.) We now show that all monotone circuits for f^l' k have size Cfj moo I f^' k ) > 

l( L8 )min( v / fe=TAn/(2fc)) ) wh j ch ^ 2 a(n 1 ' 3 ) for y. proport i ona l t0 n 2/3_ 
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TEST CASES The quality of an approximation to the clique function / cl ™L c k is determined 
by providing positive and negative test inputs. A fc-positive test input is a binary n(n — l)/2- 
tuple that describes a graph containing a single fc-clique. 

The negative test inputs, defined below, describe graphs that have many edges but not 
quite enough to contain a fc-clique. A special set of negative test inputs is associated with 
balanced partitions of the vertices of an n-vertex graph G = (V,E). A (fc — l)-balanced 
partition of V = {v\, . . . , v n } is a collection of fc — 1 disjoint sets, V\, V2, ■ ■ ■ , V k -i, such 
that each set contains either \n/(k — 10] or [n/(k — 1)J elements. (By Problem 9.5 there are 
w = n mod (k — 1) sets of the first kind and k — 1 — w sets of the second kind.) The graph 
associated with a particular (fc— 1 ) -balanced partition has an edge between each pair of vertices 
in different sets and no other edges. For each (fc — l)-balanced partition, a fc-negative test 
input is a binary n(n — l)/2-tuple x describing the graph G associated with that partition. 

LEMMA 9.6.5 There are t+ k-positive test inputs, where 



T + 



y kJ k\{n-k)\ 
and.T^ k-negative test inputs, where for w = n mod (fc — 1) 

n! 



(r s =rl0 u '(Lfc3TJ0 fe - 1 -^!(fc-i-^)! 

Proof It is well known that t+ = (?J. To derive the expression for t_ we index each 
element of each set in a (fc— 1) -balanced partition. Such a partition has w = n mod (fc— 1) 
sets containing \n/(k — 1)] elements and fc — 1 — w sets containing [n/(fc — 1)J elements. 
The elements in the first w sets are indexed by the pairs {(i, 1), (i,2), . . ., (i, \n/(k — 
1)])} for 1 < i < w. Those in the remaining fc — 1 — w sets are indexed by the pairs 
{(i,l),(i,2),...,(i, [n/(k - 1)J)} for w + 1 < i < k - 1. (See Fig. 9.13.) Let V 
be the set of all such pairs. To define a fc-negative graph, we assign each vertex in the set 
V = {1,2, ... ,n] to a unique pair. This partitions the vertices into fc — 1 sets. If vertices 
v a and Vb are in the same set, the edge variable x a> f, = 0; otherwise x ai b = 1. These 
assignments define the edges in a graph G = (V, E). There are n\ assignments of vertices 
to pairs. Of these, there are {\n/(k - 1)1 \) w {\n/{k - 1)J \) k - l ~ w w\{k - \ - w)\ that 



(1,1) (1,2) (1,3) (1,4) 


(2,1) (2,2) (2,3) 


(3,1) (3,2) (3,3) 




1>3 V 7 V\ v 2 


V 9 V 5 V m 


V 6 1> 4 V S 




V2 V\ W3 Vj 


Vi v w v 9 


V4 Vg v 6 



Figure 9. 1 3 A set of pairs T indexing the elements of sets in a (fc — l)-balanced partition of 
a set V of n vertices. In this example n = 10 and fc = 4 and the partition has three sets, Vi, 
V2, and V3 containing four, three, and three elements, respectively. Shown are rwo assignments of 
variables to pairs in "P that correspond to the same partition of V. 
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correspond to each graph. To see this, observe that there are \n/(k — 1)] ! ways to permute 
the elements in each of the first w sets and [n/(k — 1)J ! ways to permute the elements in 
each of the remaining k — 1 — w sets. Also, each of the first w (the last k — 1 — w) sets have 
the same size and can be ordered in any of w! ((k — 1 — w)\) ways without changing the 
graph. ■ 

APPROXIMATOR CIRCUITS It simplifies the development of lower bounds to assume that each 
AND gate in a circuit is followed by an OR gate and vice versa and that the output gate is an 
AND gate. This requirement can be met by interposing between successive AND (OR) gates 
an OR (AND) gate both of whose inputs are connected together. Since this transformation at 
most triples the number of gates, an exponential lower bound on the size of the transformed 
circuit yields an exponential lower bound on the size of the original circuit. 

A monotone circuit for f^' k has (edge) variables drawn from the set {xij | 1 < i < 
j < n}. The approximation to an input variable x^j is Xij itself. Gates in a circuit are succes- 
sively replaced by approximator circuits starting with a gate that is at greatest distance from the 
root (output vertex) and continuing with previously unvisited gates at greatest distance from 
the root. Thus, when an AND or OR gate is replaced, its inputs have previously been replaced 
by functions /; and f r that approximate the functions gi and g r computed in the original 
circuit. 

Approximations to AND (A) and OR (V) gates are denoted A and V, respectively. As seen 
below, the approximation given to a gate is context dependent. Approximations are defined 
in terms of endpoint sets. Given a set of edge variables, for example {2:1,2, x\$, £2,3, 2)1,4}, its 
associated endpoint set is the set of vertex indices used to define the edge variables, which is 
{1, 2, 3, 4} in this example. Given a term t (a product (AND) or sum (OR) of edge variables), 
the endpoint set associated with it, E(t), is the endpoint set of the edge variables appearing in 
the term. For example, if t = X\£ A £1,3 A 2:2,3 A £1,4 or t = 2:1,2 V 2:1,3 V 2:2,3 V 2:1,4, then 
E(t) = {1,2, 3, 4}. The endpoint size of a term t, denoted |-E(£)|, is the number of indices 
in E(t). 

Consider a gate to be approximated. Let its two inputs be from gates that compute func- 
tions /; and f r . Like any function, f r and /; can be represented in either the monotone SOPE 
or POSE form. (All SOPEs and POSEs in this section are monotone.) The approximation 
rules for AND and OR gates are described below and denoted A and V, respectively. Here we 
\etp= [y/{k- 1)/2J andg= [n/{4k)\. 

A: The approximation /;A/ r to /; A /,. is obtained by representing // A /,. in the sum-of- 
products expansion (SOPE) and eliminating all product terms whose endpoint set contains 
more than p vertices. It follows that fiAf r > fiAf r . 

V: The approximation //V/ r to /( V f r is obtained by representing /; V f r in the product-of- 
sums expansion (POSE) and eliminating all sum terms whose endpoint set contains more 
than q vertices. It follows that /; V f r < /;V/ r . 

Since fi A f r > fiAf r and // V f r < /;V/ r , if a positive test input x causes the output 
of the approximated circuit to have value when it should have value 1 , then there is an 
approximated AND gate (including the output gate) that has value on x when it should have 
value 1. Similarly, if there is a negative test input x that causes the approximated output to be 
1 when it should be 0, there is an approximated OR gate that has value 1 on a; when it should 
have value 0. We now examine the performance of approximator circuits. 
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PERFORMANCE OF APPROXIMATOR CIRCUITS We now show that when the approximation pro- 
cess is complete, the approximation circuit for f^' k makes a very large number of errors 
but that each gate approximation introduces a small number of errors. Thus, many gates must 
have been approximated to produce the large number of errors made by the fully approximated 
circuit. In fact, we show that the approximating circuit for f^' k either has output identi- 
cally 0, thereby making one error on each of the r + = (?) positive test inputs (it produces 
when it should produce 1), or makes t_/2 errors on the r_ negative test inputs (it produces 
1 when it should produce 0). On the other hand, we also show that approximating one AND 
or OR gate causes a small number of errors, at most e A ND errors per AND gate on positive 
test inputs and at most eoR errors per OR gate on negative test inputs, quantities for which 
upper bounds are derived below. It follows that the original circuit for / c i"L G k either has at 
least r+/e A ND AND gates or at least r_/(2eoR) OR gates. The lower bound on the monotone 
circuit size of f^' k is the larger of these two lower bounds. 

LEMMA 9.6.6 Let k < n + 1. Then any approximation circuit for /cliaue fc e ^ er computes a 
function that is identically zero or makes errors on half of the k-negative test inputs. 

Proof Let the approximation circuit for /clique fc compute t h e function /clique fc- If this 
function is identically zero, we are done. Suppose not. Since the output gate in the original 

circuit is an AND gate, the function f^' k is represented by a SOPE in which each term 
is the product of variables whose endpoint set (the vertices involved) has size at most p. 

Because f^' k is not identically zero, there is a non-zero term t such that f cVl ' a k > t. 
An error is made on a negative test input if t = 1 . But this happens only if each of the 
endpoints in E(t) is in a different set of the (fc — l)-balanced partition defining the negative 
test input. 

Let (ft be the fraction of the negative test inputs on which t = 1 . We derive a lower 
bound to <fi by deriving an upper bound on the fraction \ of the (fc — 1) -balanced partitions 
with the property that two or more vertices in E(f) fall into the same set. It follows that 

<i> > i - x- 

To simplify bounding x> we use the one-to-one correspondence developed in the proof 
of Lemma 9.6.5 between the n vertices in V = { 1, 2, 3, . . . , n} and the pairs V associated 
with a (fc — l)-balanced partition. Since E(t) has at mostp vertices, the number of ways to 
assign two vertices from E(t) to pairs in V so that two of them fall into the same set, Nj, 
is at most the number of ways to choose two vertices from a set of p vertices, p(p — l)/2, 
times the number of ways of assigning these two vertices to pairs in V , ra-i, and the number 
of ways of assigning the remaining n — 2 vertices, (n — 2)!. Here rri2 is at most the product 
of the number ofways of choosing a pair for the first vertex, (fc — l)|~n/(fc — 1)], and the 
number ofways of choosing a pair for the second from the same set, \n/(k— 1)] — 1. Thus, 
N 2 is at most (p(p— l)/2)(fc— l)|~n/(fc— l)](|~n/(fc— 1)] — l)(n — 2)!, which is at most 
p \n/(k — 1)] (n — l)!/2. Since there are n\ assignments of vertices in V to pairs in V ', 
X < p 2 [n/(fc — 1)~| /(2tt,). Because^ = [\/(k — l)/2j, x is at most 1/4 since fc— 1 < n. ■ 

We now derive upper bounds on the number of errors introduced through the approxima- 
tion of individual AND and OR gates. Since we have assumed that AND and OR gates alternate 
on any path between inputs and outputs, it follows that the inputs // and f r to an AND gate 
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are outputs of OR gates (and vice versa). Furthermore, by the approximation rules, if/; and /.,. 
are inputs to an AND (OR) gate, every sum (product) in their POSE (SOPE) has an endpoint 
set size of at most q (p). We now show that each replacement of a gate by its approximator 
introduces a relatively small number of errors. We begin by establishing this fact for OR gates. 

LEMMA 9.6.7 Let an OR gate V and its approximation V each be given as inputs the functions 
fi and /.,. whose SOPE contains product terms of endpoint size p or less. Then the number of 
k-negative test inputs for which V and V produce different outputs (V has value but V has value 
I) is at most e OR where w = n mod (k — 1); 

{n/2) q+ \n-q- 1)! 

eoR ~ 



{\n/{k- l)V-) w ([n/(k- \)\l) k - l ~ w w\{k- \-w)\ 

Proof Let /correct = fl V f r and / app rox = f$fr- Let t\, . . . , U be the product terms 
in the SOPE for /correct- Since the endpoint size of all terms in the SOPE of /correct is a t 
most p, each term is the product of at most p(p — l)/2 variables. 

Using the association between (fc — 1) -balanced partitions and pairs of indices given 
in the proof of Lemma 9.6.5, we count N, the number of one-to-one mappings from V 
to V for which /correct (as) = but /approx (x) = 1, after which we divide by D, the 
number of mappings corresponding to a single partition of the variables, to compute e R = 
N/D. From the proof of Lemma 9.6.5 we have that D = {\n/(k - l)]\) w ([n/(k - 
l)J!) fe - 1 -' !i, w!(fc- 1 -w)\. 

To derive an upper bound to N, observe that / apP rox( a; ) is obtained by converting the 
SOPE of /correct to a POSE and deleting all sums in this POSE whose endpoint set size 
exceeds q. Thus, N is at most the number of ways to assign vertices to pairs in V that 
causes a deleted sum to be because the new POSE may now become 1. But this can 
happen only if the endpoint set size of the deleted product is at least q + 1. Thus, only if at 
least q + 1 vertices in a sum are assigned values is it possible to have /correct (^0 = and 

Japprox(^) — t. 

Below we show that each vertex can be assigned at most n/2 different pairs in V . It 
follows there are at most {n/2) q+l (n — q — 1)! ways to assign pairs to q + 1 or more 
vertices because the first q + 1 can be assigned in at most (n/2) q+1 ways and the remaining 
(n — q — 1) vertices can be assigned in at most (n — q — 1)! ways. This is the desired upper 
bound on N. 

We now show that every mapping from V to V that corresponds to a negative test input 
x assigns each vertex to at most n/2 pairs in V. 

Let t\, . . . ,ti be product terms in the SOPE of /correct- We examine these terms in 
sequence. Consider a partial mapping from V to V that assigns values to variables so that 
at least one variable in each of the products t\, . . ■ , ti-\ is 0, thereby insuring that each 
product is 0. Consider now the ith product, t{. If the partial mapping assigns value to at 
least one of its variables, we move on to consider ti+i. (It cannot set all variables in £j to 1 
because we are considering mappings causing all terms to be 0.) 

Suppose that the partial mapping has not assigned value to any of the variables of ti. 
There are two cases to consider. For some variable x a ,b of i, either a) one or b) both of the 
vertices v a , Vb € V has not been assigned a pair in V . In the first case, assign the second 
vertex to the set containing the first, thereby setting x a ,b = 0. This can be done in at most 
\n/(k—l)~\ — l < n/(k—l) ways since the set contains at most \n/(k—l)~\ elements and at 
least one of them has been chosen previously, namely the first vertex. In the second case the 
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two vertices can be assigned to at most (k— l)(\n/(k— l)])([n/(fc— 1)~| — 1) < 2n 2 /(k— 1) 
pairs because the first can be assigned to (k — 1) sets each containing at most \n/(k — 1)] 
elements and the second must be assigned to one of the remaining elements in that set. 

The number of ways to choose variables in ti so that it has value is the number of 
ways to choose a variable of each kind multiplied by the number of ways to assign values to 
it. Let a be the number of variables of U for which one vertex has previously been assigned 
a pair and let (3 be the number of variables for which neither vertex has been assigned a 
pair. ((3 < p(p — l)/2 — a since ti has at most p(p — l)/2 variables.) Thus, a variable 
of the first kind can be assigned in at most an/(k — 1) ways and the number of ways of 
assigning the two vertices in variables of the second kind is at most j32n / (k — 1). Since 
each vertex associated in such pairs can be assigned in the same number of ways, 7, it follows 
that 7 2 < /32n 2 /(fc - 1). Thus, 7 < ^Jf32n 2 /(k - 1). 

Summarizing, the variables in tj can be assigned in at most the following number of 
ways so that ti has value 0: 

an/(k - 1) + y/(p(p - l)/2 - a)2n 2 / (k - 1) 



This quantity is largest when a = and is at most n/2 since p = [\/{k — l)/2j , which is 
the desired conclusion. ■ 

We now derive an upper bound on the number of errors that can be made by AND gates 
on fc-positive inputs. 

LEMMA 9.6.8 Let an AND gate A and its approximation A each be given as inputs the functions 

fi and f r whose POSE contains sum terms ofendpoint size q or less. Then the number ofk-positive 

test inputs for which A and A produce different outputs (A has value 1 but A has value 0) is at 

most e A ND-' 

_ {n/2)P + \n-p- 1)! 

CAND " k\(n-k)\ 

Proof The proof is similar to that of Lemma 9.6.7. Let /correct = fi A f r and / apP rox = 
fi A f r . Let C\, . . . ,ci be the sum terms in the POSE for /correct- Since by induction the 
endpoint size of all terms in the POSE of/; and /,. is at most q, each term in /correct is the 
sum of at most q(q — l)/2 variables. 

In this case we count the number of fc-positive test graphs (they contain one fc-clique) 
that cause /correct (x) = 1 but f a ppmx(x) = 0. Since a fc-positive test graph contains just 
those edges between a specified set of fc vertices, we define each such graph by a one-to-one 
mapping from the vertices (endpoints) in V to the integers ]N(n) = {1,2,..., n}, where 
we adopt the rule that vertices mapped to the first fc integers are those in the clique associated 
with a particular test graph. It follows that each fc-positive test graph corresponds to exactly 
fc!(n — fc)! of these 1-1 mappings. Then, e AND is the number of such 1-1 mappings for 
which /corrective) = 1 but / a pprox(a:) = divided by k\(n - fc)!. 

We show that any mapping that results in /correct (x) = 1 assigns each endpoint to at 
most n/2 values from ]N(n). But f appl ox{ x ) = for positive test inputs only if more than 
p endpoints are assigned values, because / apP rox is obtained from /correct by discarding 
product terms in its SOPE that contain more than p endpoints. It follows that at most 
(n/2) p+1 (n — p — \)\ of the positive test inputs result in an error by the approximate AND 
gate. Dividing by fc!(n — fc)!, we have the desired upper bound on e A ND- 
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To complete the proof we must show that each endpoint is assigned at most n/2 values 
from ]N(n). Consider the sum terms C\, . . . , ci in the POSE of /correct in sequence and 
consider a partial mapping from V to ]N(n) that causes at least one variable in each of the 
sums C\, . . . , Ci-i to be 1, thereby insuring that the value of each sum is 1. Now consider 
the «h sum, Cj. If the partial mapping assigns value 1 to at least one variable, we move on 
to Cj+i. (It cannot set all variables in Ci to because we are considering mappings causing 
all terms to be 1 .) 

We now extend the mapping by considering the set Ci of variables of Ci that have not 
been assigned a value. A given variable x ai t, in Ci has either one or no endpoints (vertices) 
previously mapped to an integer in ]N(n). If one endpoint, say a, has been assigned an 
integer, the other endpoint, b, can be assigned to at most one of k — 2 integers that cause 
%a,b = 1 because endpoint a was previously assigned a value in the range {1,2, ... ,k} 
together with at least one other vertex and b must be different from them. Because there are 
most q = [n/(4k)\ variables of the first type, there are at most q(k — 2) ways to assign the 
one endpoint of a variable x a ,b of the first type so that x a ,b = 1- 

Consider now variables of the second type. There are at most q(q — l)/2 such variables 
and at most (q(q — \)/2)k(k — 1) ways to make assignments to both endpoints so that 
a variable has value 1 . This follows because each endpoint is assigned to a distinct integer 
among the first k integers in ]N(n). Since each endpoint can be assigned in the same number 
of ways, this number is at most \/{q{q — l)/2)k(k — 1). 

It follows that the number of ways to assign an endpoint so that the correct and approx- 
imate functions differ is at most q(k — 2) + \/(q{q — \)/2)k{k — 1) < 2qk, which is no 
more than n/2 since q = [?V(4fc)J . This is the desired conclusion. ■ 

The desired result follows from the above lemmas. 



THEOREM 9.6.6 Forn > 13 and 8 < k < n/2, every monotone circuit for the clique function 

An) 

' clique, k 



/clique k ' & n l— v & has a circuit size satisfying the following lower bound: 



C (An) \ 1/. 8V nin(v/fc=T/2,n/(2fc)) 

°n mo „ ^/clique.fcy ^ 2 

The largest value for this lower bound is Cn mon (/ c i"q UC k) = 2 '" '. 

Proof From the discussion at the beginning of this section, we see that the monotone circuit 
size of / cl "q UCifc is at least min (r+/e AND , r_/(2e OR )). Thus, 

c ,An) ]> . ( n\ n\ \ 

n m „„Uciique,fci - mln \ 2 {n/2)P+\n-p- 1)!' {n/2)i+ l (n - q - l)\ ) 

'{n-p)P +1 {n-q)i +1 ' 



> min 



2(n/2)P +1 ' {n/2)i+ l 



Let 8 < k < n/2. It follows thatp = [y/{k- l)/2j < y/n/ '(2^2) and q = \n/{2k)\ < 
n/lG. Thus, p, q < n/10ifn > 13. Hence both (n—p) and {n — q) are at least 9n/ 10, and 



Ci 



-(/£U)> -in Q(1.8)^, (1.8)^) 
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The desired conclusion follows from this and the observation that p + 1 > y/k — 1/2 and 
q + 1 > n/ (2k). That the maximum value of min(v / ^^T/2, n/(2k)) is il(n 1 ' 3 ) under 
variation of fc is left as a problem. (See Problem 9.38.) ■ 

9.6.4 Slice Functions 

Although, as shown above, some monotone functions have exponential circuit size over the 
monotone basis, it is doubtful that the methods of analysis used to obtain this result can be 
extended to derive such bounds over the standard basis. (See the Chapter Notes.) 

This section introduces a note of optimism by showing that the monotone circuit size of 
monotone slice functions can provide a strong lower bound on the circuit size of such functions 
over the standard basis. In addition, there are NP-complete languages whose characteristic 
functions are slice functions. Thus, if such functions can be shown to have super-polynomial 
monotone circuit size, P 7^ NP. 

Let I a; I denote the number of l's in x. We now define the slice functions. 

DEFINITION 9.6.6 A function s : B n i—>Bisa slice function if there is an integer < fc < n 
such that s(x) = if \x\ < k and six) = 1 if \x\ > fc. The kth slice of a function 

/ : 3 n 1— ► 3, < fc < n, is the function f^ : B n i-> B defined below. 



f [k \ X ) 



\x\<k 
f(x) \x\ = k 

1 \x\> k 



It should be clear from this definition that slice functions are monotone. Below we show 
that if a Boolean function f on n variables has a large circuit size, then one of its slices has a 
circuit size that differs from the size of/ by at most a multiplicative factor that is linear in n. 
Thus, a function / has a large circuit size if and only if one of its slice functions has a large 
circuit size. 

We set the stage with a lemma that shows that the circuit size of a Boolean function is 
bounded above by the circuit size of its slices plus an additive term linear in its number of 
variables. 

LEMMA 9.6.9 LetQo be the standard basis and f : B n 1— > B. Then the following holds, where 
Cf2 (/ > />■•■'/) is the circuit size of all the slices simultaneously: 

Cn if) = Cn if [0] J [1] ,...,f [n] )+Oin) 

Proof The goal is to construct a circuit for / given the input tuple x and a circuit for 
all the functions p ', p ', . . . , /'"'. This is easily done. We construct a circuit to count 
the number of l's among the n inputs and represent the result in binary. We then supply 
this number as an address to a direct storage address function (multiplexer) where the other 
inputs are the values of the slice functions. If the address is \a\, the output of the multiplexer 
is / " a '' . Since, as shown in Lemma 2.11.1, the counting circuit can be realized with a circuit 
of size linear in n, and, as shown in Lemma 2.5.5, the multiplexer in question can be realized 
with a linear-size circuit, the result follows. ■ 

We now establish the connection between the circuit size of a function and that of one of 
its slices. 
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THEOREM 9.6.7 LetO, be the standard basis and f : B n <— > B. Then there exists < k < n 
such that 

^^ - 0{\) < C na (f [k] ) < CnSf) + 0{n) 

Proof The first inequality follows from Lemma 9.6.9, the following inequality and the 
observation that at least one term in an average is greater than or equal to the average. 



Cn (/[°] ) /W ) ...,/W)<E^o(/ [il ) 



The second inequality uses the fact that the fcth slice of a function can be expressed as 

Since r j (x) can be realized by a circuit of size linear in n (see Theorem 2.11.1), the second 
inequality follows. ■ 

In Theorem 9.6.9 we show that the monotone circuit size of slice functions provides a 
lower bound on their non-monotone circuit size up to a polynomial additive term. Before 
establishing this result we introduce the concept of pseudo-negation. A pseudo-negation for 
variable Xi in a monotone Boolean function / : B n i— > B is a function hi such that replacing 
each instance of Xi in a circuit for / by hi does not change the value computed by the circuit. 
Thus, the pseudo-negation hi acts like the real negation afj. 

In Theorem 9.6.9 we also show that for 1 < i < n the punctured threshold function 
Tf, _% : B n i— > B, which depends on all the variables except Xi, is a pseudo-negation for a fcth 
slice of every monotone function. Since for a given k each of these threshold functions can be 
realized by a monotone circuit of size 0(n log n) (see Theorem 6.8.2), they can all be realized 
by a monotone circuit of size 0(n 2 log n). Although this result can be used in Theorem 9.6.9, 
the following stronger result is used instead. 

We now describe a circuit that computes all of the above pseudo-negations efficiently. This 
circuit uses the complementary number system, a system that associates with each integer i 
in the set IN(n) = {0, 1, 2, . . . , n — 1} the complementary set ]N(n) — {«}. It makes use of 
results on sorting networks found in Chapter 6. 

THEOREM 9.6.8 The set {i~u J,; | 1 < i < n} of pseudo-negations can be realized by a monotone 
circuit of size O ( n log n) . 

Proof We assume that n = I s . If not, add variables with value to increase the number to 
the next power of 2. This does not change the value of the function on the first n variables. 
For this proof let the pseudo-negations T^"^ be defined for < % < n — 1 and on the 
variables whose indices are in IN(n). (We subtract 1 from each index.) Let Di = ]N(n) — 
{i} denote the indices of the variables on which Tf, _|- depends. An efficient monotone 

circuit to compute all the pseudo-negations {t^j^ | i G IN(n)} is based on an efficient 
decomposition of the sets {Di \ i G ]N(n)}. 
For a, b > 0, let U ai b be defined by 

U a>b = {al b + c|0<c<2 6 -l} 
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For example, U 33 = {24, 25, 26, 27, 29, 30, 31}, U h2 = {4, 5, 6, 7}, and f/ 2 ,i = {4,5}. 
The set U a ,b has size 2 . 

For n = 2 s , every set Dj = ]N(n) — {i} can be represented as the disjoint union of the 
sets U a ,b below, where < ay < 2 S ~ 3 — 1 . (This is the complementary number system; 
see Fig. 9.14.) 



Di = U ai 



-1 U Va, 



u ■ • • u u a . ofi 



To see this, note that if i is in the first (second) half of IN(n), U ai fl _,, s -i denotes the second 
(first) half; that is, a.j, s -i = 1 (aj, s -i = 0). The next set, U ais _ l3 s-2> is the half of the 
remaining set Di — U ai3 _ uS -\ that does not contain i, etc. Thus, Di is decomposed as 
the disjoint union of sets of size 2 S_1 , 2 s-2 , . . . , 2° For example, when n = 16, D3 = 
(7i3 U C/12 U C/0,1 U f/2,0- Figure 9.14 shows the values of Oj , s -i, fli, s -2> • ■ ■ > a i,o for each 
i G N(n) for n = 8. 

As suggested in Fig. 9.14, the sets {Di \ i G IN(n)} have either Uo, s -i or U\ iS -i in 
common. Similarly, they also have either (7i )S _i Uf/ 1]S _ 2 , (7o, s -i U(7i ]S _2, t^3,s-i U t/o,s-2> 
or U2, s -\ U Uo, s -2 m common. Continuing in this fashion, we construct the sets {Di | i G 
]N(n)} by successively forming the disjoint union of 7? sets, 1 < j < s. Assembling the 
sets in this fashion is much more economical than assembling them individually. 

The value of Tf, jj-, i G IN(n), is the fcth largest variable whose index is in Di. From now 
on we equate the variables with their indices. Sorting the sets into which Di is decomposed 
simplifies the computation. But these sets are exactly the sets that are sorted by Batcher's 
sorting network based on Batcher's merging algorithm. (See Theorem 6.8.3.) Since on 
Boolean data a comparator consists of one AND for the max operation and one OR for the 
min operation, a monotone circuit of size 0(nlog n) exists to sort the sets {Uij | < i < 
2 s ~ j - 1,0 < j < s- 1}. 

The functions t^™^, < i < n— 1, can be obtained by sorting the sets {Uij | < i < 
2 S ~ J — 1, < j < s — 1}, merging them in groups to form Di for i £ IN(n), as suggested 
above, and then taking the fcth largest element. A faster way merges the sorted versions of 
the sets U ais _ uS -i, U ai3 _ ltS -2, ■ ■ ■ , U aio ,o in the order in which Di is assembled above. 
For each of these sets the sorting network presents its elements in sorted order. 
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Figure 9.14 The coefficients a»,j of Di = N(n) 
C/a iiS _,, s -2 U • • • U [/o w ,o for n = 2 s = 8 and s = 3. 



{i} in the expansion [/ a 
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Since only the fcth element of Di is needed, it is not necessary to merge all the elements 
in each set when two sets are merged. To see which elements need to be merged, let Aj(j') = 
U ai3 _ uS -i U Ua ia _ 1 , s -2 U ■ • • U U ai ■ ,j ■ Then Di — Ai(j) is a set of size 2? — 1. Observe 
that the fcth element of Di can be obtained by merging elements of rank fc and fc — 1 of 
Aj(l) with the element of {/ a (i,o),o- (They all have value or 1.) The middle element is the 
fcth element in Di. To obtain elements of rank fc and fc — 1 of Aj(l), the elements of rank 
fc, fc — 1, fc — 2 and fc — 3 of Aj(2) are merged with the two elements of U ail ,i and the 
middle two taken. In general, to obtain the elements of rank fc, . . . , fc — V + 1 of Ai(j), 
the elements of rank fc, . . . , fc — 2 J+l + 1 of Aj(j + 1) are merged with the 1? elements of 
U aid ,j and the middle 2 3 taken. 

We now count the number of extra AND and OR gates needed to perform the merges. 
There are 2 S ~' J sets AjQ). The 2 3 elements needed from these sets are obtained by merging 
2- J+1 elements of Ai(j + 1) with the 2' J elements of U ai ,,j- Since these sets can be merged 
in a comparator network with 0(]2 :i ) comparators (see Theorem 6.8.2), it follows that all 
the sets Aj(j), < i < n — 1, can be formed with O(jn) gates for < j < s — 1. 
Summing over j, < j < (log 2 n) — \ shows that a total of 0(n log n) extra gates suffice. 
Since 0(nlog n) gates are used to sort the sets {Uij | < i < 2 s- -' — 1, < j < s— 1}, 
the desired conclusion follows. ■ 

We can now show that a large lower bound on the monotone circuit size of a slice function 
implies a large lower bound on its non-monotone circuit size. The importance of this statement 
is emphasized by the existence of NP-complete slice functions. If such a problem can be shown 
to have a super-polynomial slice function, then P ^ NP. 

THEOREM 9.6.9 Let f : B n ^ B be a slice function. Then 

Cn„(/) < C 0mon (/) < 2 • C no (f) + 0(n\og 2 n) 
Proof The first inequality holds because the standard basis f2o contains the monotone basis. 
To establish the second inequality, we convert a circuit over Qq by moving all negations to 
the input variables. This can be done by at most doubling the number of gates. (See 
Problems 9.11 and 2.12.) 

We now show that for slice functions the negation of an input variable X% can be replaced 
by the pseudo-negation function r). 1^. To see this, observe that when \x\ > fc, at least 
| x j — 1 = fc of the variables of t^ _^ are 1 and t^ _^ has value 1 . On the other hand, 
when |a;| < fc, then not enough variables can be 1 for Tj, _^ to have value 1. Finally, when 
| ac | = fc, t^!;- = if Xi = 1 because not enough of the remaining variables are 1, and 
T^ _,j = 1 when Xi = by a similar reasoning. Now replace X% with t^. _ li . Since / is a 

fc-slice, / = when \x\ < fc, as is ri jj-. If x~i = 1 when \x\ < fc, replacing Xi by its 
pseudo-negation means replacing Xi by 0, which can only decrease the circuit output since 
it is monotone. Thus, / is computed correctly in this case. The same is true if \x\ > fc, 
again by monotonicity. Since T^_,^ = Xi when \x\ = fc, the circuit correctly computes / 
for all inputs when Xt is replaced by the ith pseudo-negation. ■ 

AN NP-COMPLETE SLICE FUNCTION We now exhibit the language HALF-CLIQUE CENTRAL 
SLICE and show it is NP-complete. The characteristic functions of this language are slice func- 
tions. It follows from Theorem 9.6.9 that if these slice functions have exponential circuit size, 
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then P 7^ NP. We show that HALF-CLIQUE CENTRAL SLICE is NP-complete by reducing 
HALF-CLIQUE (see Problem 8.25) to it. 

DEFINITION 9.6.7 A central slice of a function f : B" t—>Bonn variables, f^ n l 2 ^, j s the 

\n/2] th slice. 

A central slice of a function f on n variables is the function that has value if the weight 
of the input tuple is less than \n/2] , value 1 if the weight exceeds this value, and is equal to 
the value of / otherwise. 

Given the function / : B* i— > B, f^ n ' denotes the function restricted to strings of length 
n. The family of central slice functions {(/(™))ir™/ 2 l] | n > 2} identifies the language 

Lccntral(/) = {* € B n | (f^)^™(x) = 1,„ > 2}. 

(n) 

The central clique function / c i iQUC r n / 2 -] nas value 1 if the input graph contains a clique 

on \n/2] vertices. The central slice of the central clique function f^' \ n /2\ ls called the 

half-clique central slice Function and denoted / c \jq UC s y icc - It has value 1 if the input graph 
either contains a clique on \n/2] vertices or contains more edges than are in a clique of this 
size. 

The language HALF-CLIQUE is defined in Problem 8.25 as strings describing a graph and 
an integer k such that a graph on n vertices contains an n/2-clique or has more than k edges. 
The language HALF-CLIQUE CENTRAL SLICE associated with the central slice of a central 
clique function is defined below. It simplifies the following discussion to define e(k) as the 
number of edges between a set of k vertices. Clearly, e(k) = ( 2 J. 

HALF-CLIQUE CENTRAL SLICE 

Instance: The description of an undirected graph G = (V, E) in which |V| is even. 

Answer: "Yes" if G contains a clique on |V|/2 vertices or at least e(|V|/2)/2 edges. 

THEOREM 9.6.10 The language HALF-CLIQUE CENTRAL SLICE is NP '-complete. Further- 
more, for all 2 < k < n 



Cn» 



( (An) \ [k] \ < c (An) \ 

\ \J clique, \n/2\ J J — "mon \J clique slice y 



L J _ e(n) 



Fork < e(n/2), (/ c ( ^ UCir „ /2l ) - - fc+1 . 

Proof We show that HALF-CLIQUE CENTRAL SLICE is NP-complete by reducing HALF- 
CLIQUE to it. Given a graph G = (V, E) in HALF-CLIQUE that has n vertices, n even, we 
construct a graph G' = (V' , E') on 5n vertices such that G either contains an n/2-clique 
or has more than k edges if and only if G' contains a (central) clique on 5n/2 vertices or 
has at least \e{5n/2)/2\ edges. The construction, which can be done in polynomial time, 
transforms a graph on n vertices to one on 5n vertices such that the former is an instance of 
HALF-CLIQUE if and only if the latter is an instance of HALF-CLIQUE CENTRAL SLICE. 

Let V = {vi,v 2 , ■ ■ ■ ,v n }. Construct G' from G by adding the An vertices R = 
{n, rj, ■ ■ ■ , T2n} and S = {s\, s 2 , ■ • ■ , Sin}- Represent edges in E' of G' with the edge 
variables {yij \ 1 < i < j < 5n}. Each edge between vertices of G is an edge between 
vertices V of G". Let every edge between vertices in R be in G' as well as all edges between 
vertices in V and R. Set the edge variables so that the edges between r^ and Sj, 1 < i < 2n, 
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are absent. The unassigned variables are between vertices in S, between vertices in R and S, 
and between vertices in V and S, of which there are 8n 2 — 3n. Fix these unassigned edges 
so that the number of edges between vertices in VURU S is \e(5n/2)/2] — k, 1 < k < n. 
There are sufficiently many unassigned edges to do this. 

We now show that G contains an n/2-clique or has more than k edges if and only if 
G' contains an 5ri/2-clique or has more than \e(5n/2)/2] edges. If G has a n/2-clique, 
the edges between V and R combined with the edges between vertices in R and those in 
G constitute a 5n/2 clique since 5n/2 vertices in V U R are completely connected. If V 
has more than k edges, since there are exactly \e(5n/2)/2~\ — k edges between vertices in 
VURUS, G' has at least |~e(5n/2)/2] edges. On the other hand, if G" has a (5n/2) -clique, 
because there is at least one absent edge between each pair of vertices [fi, Si), 1 < i < 2n, 
the largest clique on vertices in R U S has size 2n. Thus, there must be a (n/2)-clique 
on vertices in V; that is, G contains a (n/2)-clique. Similarly, since the number of edges 
between vertices in V and those in R U S is exactly \e(5n/2)/2~\ — k, if G" contains at least 
\e(5n/2)/2~\ edges, G must contain at least k edges. 

The membership of graph G in HALF-CLIQUE is determined by specializing the graph 
G' by mapping its edge variables to the constants and 1 or to variables of G. Thus, 
the function testing G's membership is obtained through a subfunction reduction of the 
function testing G"s membership. (See Definition 2.4.2.) Thus, at no increase in circuit 

size, for any k a circuit for f /cUq Uej r„/ 2 ] J can De obtained from a circuit for / cl " quo slicG . 
Thus, the circuit size for the latter is at least as large for the former, which gives the second 

result of the theorem. 

[fc] 

' clique, \n/2\ ) ~ 'fc+1 

servation that for these values of k the value of the clique function on inputs of weight 
e(n/2) — 1 or less is 0. ■ 

As this theorem indicates, the search for a proof that P =/= NP can be limited to the study 
of the monotone circuit size of the central slice of certain monotone functions. Other central 
slices of NP-complete problems have been shown to be NP-complete also. (See the Chapter 
Notes.) 



The statement that for k < e(n/2), ( /S [ n ii\ ) = T k+\ f°ll° ws from the ob- 



9.7 Circuit Depth 



Circuit depth and formula size are exponentially related, as shown in Section 9.2.3. In this 
section we examine the depth of circuits whose operations have either bounded or unbounded 
fan-in. As seen in Chapter 3, circuits of bounded fan-in are useful in classifying problems by 
their complexity and in developing relationships between time and space and circuit size and 
depth. 

Circuits of unbounded fan-in are constructed of AND and OR gates with potentially un- 
bounded fan-in whose inputs are the outputs of other such gates or literals, namely, variables 
and their negations. Every Boolean function can be realized by a circuit of unbounded fan-in 
and bounded depth, as is seen by considering the DNF of a Boolean function: it corresponds to 
a depth-2, unbounded fan-in circuit. Knowledge of the complexity of bounded-depth circuits 
may shed light on the complexity of bounded-fan-in circuits. 



©John E Savage 9.7 Circuit Depth 437 

In this section we first show that the depth of a function / is equal to the communication 
complexity of a related problem in a two-player game. Communication complexity is a measure 
of the amount of information that must be exchanged between two players to perform a com- 
putation. We establish such a connection for all Boolean functions over the standard basis f2o 
and monotone functions over the monotone basis Vl mon . These connections are used to derive 
lower bounds on circuit depth for monotone and non-monotone functions. After establishing 
these results we examine bounded-depth circuits and demonstrate that some problems require 
exponential size when realized by such circuits. 

9.7.1 Communication Complexity 

We define a communication game between two players who have unlimited computing power 
and communicate via an error-free channel. This game has sufficient generality to derive 
interesting lower bounds on circuit depth. 

DEFINITION 9.7. 1 A communication game (U, V) is defined by sets U, V C B n , where U PI 
V = 0. An instance of the game is defined by u 6 U and v 6 V. u is assigned to Player I and 
v is assigned to Player II Players alternate sending binary messages to each other. We assume that 
the binary messages form a prefix code (no message is a prefix for another) so that one player can 
determine when the other has finished transmitting a message. 

Although each player has unlimited computing power, each message it sends is a function of just 
its own n-tuple and the messages it has received previously from the other player. The two functions 
used by the players to determine the contents of their messages constitute the protocol II under 
which the communication game is played. The protocol also determines the first player to send a 
message and termination of the game. The goal of the game is to find an index i, 1 < i < n, such 
thatUi / Vi. 

let Tl(u, v) denote the number of bits exchanged under II on the instance (u, v) of the game 
(U, V). The communication complexity C(U, V) of the communication game (U , V) is the 
minimum over all protocols II of the maximum number of bits exchanged under II on any instance 
of (U,V); that is, 

C(U, V) = min max TL(u, v) 
n ueu.vev 

Note that there is always a position i, 1 < i < n, such that Ui =/= Vt since U f]V = 0. 

The communication game models a search problem; given disjoint sets of n-tuples, U and 
V, the two players search for an input variable on which the two n-tuples differ. A related 
communication game measures the exchange of information to obtain the value of a function 
/ : X x Y i— > Z on two variables in which one player has a value in X and the other has 
a value in Y. The players must acquire enough information about each other's variable to 
compute the function. 

Every communication problem (U , V), where U, V C B n , can be solved with communi- 
cation complexity C(U,V) < n + [log 2 n~\ by the following protocol: 

• Player I sends u to Player II. 

• Player II determines a position in which u ^ v and sends it to Player I using |~log 2 Ti\ 
bits. 
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This bound can be improved to C(U, V) < n + log 2 n, where log 2 n is the number of 
times that |~log 2 ] must be taken to reduce n to zero. (See Problem 9.39.) The log-star 

function log 2 n grows very slowly. For example, log 2 10 10 is 8; by contrast, log 2 10 10 = 

33,219,280,949. 

These concepts are illustrated by the parity communication problem (U, V), defined 
below, where n = 2 : 

U = {u | u has an even number of Is} 
V = {v | v has an odd number of Is} 

The following protocol achieves a communication complexity bound of C(U, V) < 2 log 7 n 
for this problem. Later we show it is best possible. 

1 . If n = 1 , the players know where their tuples differ and no communication is necessary. 

2. If n > 1, go to the next step. 

3. Player I sends the parity of the first n/2 bits of u to Player II. 

4. Since u ^ v, with one bit Player II tells Player I of half of the variables on which u and v 
are known to differ. Play is resumed at the first step with the half of the variables on which 
they are known to differ. 

Let n(n) denote the number of bits exchanged with this protocol. Then k(1) = and 
k(ji) < n{n/2) + 2, whose solution is n{n) = 2 log 2 n. Thus, C{U, V) = K,{n) < 2 log 2 n. 

9.7.2 General Depth and Communication Complexity 

We now establish a relationship between the depth Dn (f) of a Boolean function / : B n i— > B 
over the standard basis f2o and the communication complexity of a communication game in 
which U = f~ 1 (0) and V = / _1 (1), where f~ l (a) is the set of n-tuples for which / has 
value a. Theorem 9.7.1 asserts that Z?n (/) and C(/ _1 (0), / _1 (1)) have exactly the same 
value. Later we establish a similar result for monotone functions realized over the monotone 
basis. We divide this result into two lemmas that are proved separately. 

THEOREM 9.7. 1 For every Boolean function f : B n i— > B, 

Dn (f) = C(r 1 (0),r 1 (l)) 

The communication game allows the two players to have unlimited computing power at 
their disposal. Thus, the protocol they employ can be an arbitrarily complex function. This 
power reflects the non-uniformity in the circuit model. 

LEMMA 9.7. 1 For all Boolean functions f : B n i-> B and all U ,V C B n such that U C 
/ _1 (0) and V C / _1 (1), the following bound holds: 

C(U,V)<D no (f) 

Proof In this lemma we demonstrate that a protocol for the communication game (/ _1 (0), 
/ _1 (1)) can be constructed from a circuit of minimal depth for the Boolean function /. We 
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assume that such a circuit has negations only on input variables. By Problem 9.11 there is 
such a circuit. 

Given an instance defined by u G / -1 (0) and v € / _1 (l), the players follow a path 
from the circuit output to an input at which u and v differ. The invariant that applies at 
each step is that Player I (which holds u) simulates an AND gate whose value on u is 
whereas Player II (which holds v) simulates an OR gate whose value on v is 1. The bits 
transmitted by one player to the other specify which input to the current gate to follow on 
the way from the output vertex to an input vertex of the circuit for /. 

The proof is by induction. The base case applies to those Boolean functions / for which 
Dfi (f) = 0. In this case / is either Xi or Xi for some i where Xi is an input variable of 
/. Thus, for each instance of the problem, both players know in advance a variable (namely, 
Xi) on which u and v differ. Hence, C(U, V) = and the base case is established. 

For the induction step, either / = f\ A f 2 or / = f\ V f 2 . Consider the first case; the 
second is treated in a similar fashion. Obviously Dq (/) = m&x(Dfi (fi), Dsj (/2)) + 1. 
(We are considering circuits of minimal depth.) Let Uj = UC\ f~ (0) for j = 1, 2. Since 
(Uj, V) is a communication game associated with fj (fj must have value 1 on V) and 
£>Q„(/i) < £>n„(/)> by induction C(Uj,V) < D Qo (fj). 

Since the output gate is AND (the other case is treated similarly), both f\ and f 2 have 
value 1 on V, but at least one of them has value on U . We use the following protocol for 
(U,V): Player I sends if u 6 U\ (associated with the input f\ to this AND gate) and 1 
if u € U 2 (associated with the input f 2 ). (If the output gate is OR, we observe that at least 
one of /i and f 2 has value 1 on V and define V x = V R /f^l) and V 2 = V C\ f 2 \l). 
Player II sends a bit to specify the set containing v.) After the first move the players follow 
the protocol for the fj defined by the bit sent by Player I. Thus, when the output gate is 
AND the following bound holds: 

C(U,V) < 1 +max(C(C/„ J B)) < I +max(A lo (/ 1 ), J D 0o (/ 2 )) = D n „(/) 

3=1,2 

The same bound holds when the output gate is OR. ■ 

We now prove the second half of Theorem 9.7.1. 

LEMMA 9.7.2 Let U, V C B n be such that U R V = 0. Then there exists a Boolean function 
f : B n i— > B with U C / -1 (0) andV C / _1 (1) such that the following bound holds: 

Da a (f)<C(U,V) 

Proof In this proof we show how to define a Boolean function and a circuit for it from a 
protocol for (U, V). From the protocol a tree is constructed. The root is associated with the 
player who sends the first bit. As in the proof of Lemma 9.7.1, Player I is associated with 
AND gates and Player II with OR gates. Thus, if the protocol specifies that Player I makes 
the first move, the root is labeled AND. The two possible descendants are labeled with the 
player who makes the next transmission or by a variable or its negation (the answer) if this 
is the last transmission under the protocol. The function associated with the protocol is the 
function computed by the circuit so constructed. 

We establish the result by induction. The base case applies to sets U and V for which 
C(U, V) = 0. In this case, there is an index i known in advance to both players on which 
u € U and v G V differ. Since either m = 1 or u% = for all u € U («; has the 
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complementary value for all v G V), let / = Xj in the first case and / = X{ in the second. 
Thus, in the first case (the second case is treated similarly) U C / _1 (0), V C / _1 (l) and 
Dn (f) = 0. This establishes the base case. 

For the induction step, without loss of generality, let Player I send the first bit. (The 
other case is treated similarly.) For some partition of U = Uq U U\, Uq R U\ = 0, Player I 
sends a if u G Uq and a 1 if U G U\ , after which the players play with the best protocol 
for each subcase. It follows that 

C(U, V) = 1 + max(C(C/,, V)) 

Since C(Uj, V) < C(U, V) for j = 1, 2, by induction there exist Boolean functions /i 
and h such that U. } C /"'(O) and K C /r x (l) and Ai„(/j) < C(0,-, V) for j = 1,2. 
Since the output vertex is assumed to be AND, / = f\ A fx> f has value 1 only when both 
/i and ji have value 1 and has value when either f\ or / 2 have value 0. Thus, we have 

^c/ r 1 (i)n/ 2 - 1 (i) = /" 1 (i) 
u = u t U U 2 C /f^O) u /j-^o) = /- x (o) 

from which we conclude that 

DnSf) < 1 + max(Z3 no (/ 1 ) > £> no (/ 2 )) < 1 + max{C(U jt V)) = C(U,V) 

3=1.2 

which is the desired result. ■ 

This establishes the connection between the depth of a Boolean function / over the stan- 
dard basis fio and the communication complexity associated with the sets / _ ' (0) and / _ ' ( 1 ) . 

We now draw some conclusions from Theorem 9.7.1. From the observation made above 
that C(U, V) < n + logj n for an arbitrary communication problem (U, V) when U, V G 
B n , we have that Dn {f) < n + log^ n for all / : B n >— y B. A better upper bound of 
Dn {f) < w+ 1 is given in Theorem 2.13.1. The best upper bound of n — log 2 log 2 n+0(l) 
has been derived by Gaskov [110], matching the lower bound of n — (log log n) derived in 
Theorem 2.12.2. 

The parity communication problem described above is defined in terms of the two sets 
that are the inverse images of the parity function /m : B n i— > B. As stated in Problem 9.28, 
this function has a formula size of at least n 2 . Since Dn(f) > log 2 Lq (/) (Theorem 9.2.2), 
it follows that -Dn(/© ) > 21og 2 n, which matches the upper bound on the communication 
complexity of the parity communication problem. Thus the protocol given earlier for this 
problem is optimal. 

We now introduce the monotone communication game and develop a relationship be- 
tween its complexity and the depth of monotone functions over a monotone basis. 

9.7.3 Monotone Depth and Communication Complexity 

We specialize Theorem 9.7.1 to monotone functions by using the fact that if/ : B n >— > B is 
monotone and there are two n-tuples u and v such that f(u) = and f(v) = 1, then there 
exists an index i, 1 < i < n, such that itj < Vi, that is, U{ = and t)j = 1. 

The binary n-tuple x can be defined by the set {i | Xj = 1} of indices on which variables 
have value 1. This is a subset of [n] = {1,2, . . . ,n}. Let 2<- n > be the power set of [n], that 
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is, the set of all subsets of [n] . A monotone minterm (monotone maxterm) is a minimal 
set of indices of variables that if set to 1 (0) cause / to assume value 1 (0) . (The variables 
of a monotone minterm are variables in a monotone prime implicant of /.) Let min(f) 
and max(f) be the set of monotone minterms and monotone maxterms of/, respectively. 
Observe that min(f) f~l max(f) ^ because if they have no elements in common, / can 
be made to assume values and 1 simultaneously for some assignment to the variables of /, a 
contradiction. 

DEFINITION 9.7.2 A monotone communication game (A, B) is defined by sets A, B C 2^ n '. 
An instance of the game is a pair (a, b) where a G A and b G B. a is assigned to Player I 
and b is assigned to Player II. Players alternate sending messages as in the communication game, 
using a predetermined protocol. The goal of the problem is to find an integer i G a (1 b. The 
communication complexity, C m0 n(^.> B), is defined as the minimum over all protocols II of 
the maximum number of bits exchanged under II on any instance of {A, B): 

C mon (A,B)=mm max Tl(a,b) 
n aeA.beB 

We now establish a relationship between this complexity measure and the circuit depth of 
a Boolean function. 

THEOREM 9.7.2 For every monotone Boolean function f : B n t—. > B, 

Dn mon (f) = C(/- 1 (0),/- 1 (1)) = C mon (min(f),max(f)) 

Proof We show that i?o mon (/) = C(/ -1 (0), / -1 (1)) by specializing Lemmas 9.7.1 and 
9.7.2 to monotone functions. In the base case of Lemma 9.7.1 since the circuit is monotone 
we always discover a coordinate such that Uj = and Vi = 1 and negations are not needed. 
Thus, C(/ _1 (0), / _1 (1)) < Dfi mon (f). In Lemma 9.7.2, since the protocol provides 
a coordinate i such that Ui = and v% = 1, the circuit defined by it is monotone and 

£>n moa (/)<C(/- 1 (0),/- 1 (l)). 

We show that C(f (0),/ (1)) = C mon (min(f), max(f)) in two stages. First we 
show that C mon (min(f),max(f)) < C(/ _1 (0),/ _1 (1)). This follows because, given 
any a G min(f) and b G max(f), we extend a and b to binary n-tuples u and v for 
which u r = for r G a and v s = 1 for s G b and use the protocol for the monotone 
communication game to find an index i such that Ui = and Vi = 1, that is, for which 
i G a H b. Thus, the monotone communication game exchanges no more bits than the 
standard game. 

To show that C(/ -1 (0), / _1 (1)) < C mon (min(f),max(f)), consider an instance 
(u,v) of (U, V) where U = / _1 (0) and V = / _1 (1). To solve the communication 
problem (U , V), let a(u) G [n] be defined by r G a(u) if and only if a r = and let 
b(v) G [n] be defined by s G b(v) if and only if v s = 1. The goal of the standard 
communication game is to find an index i such that Ui ^ Vi. It follows from the definition 
of minterms and maxterms that there existp G min(f) and q G max(f) such that p C a 
and q C b. Since each player has unlimited computing resources available, computation of 
p and q can be done with no communication cost. Now invoke the protocol on the instance 
(p, q) of the monotone communication game (min(f), max(f)). This protocol returns an 
index i G p fl q that is also an index on which u and v differ. But this is a solution to 
the instance of (u, v) of (/ _1 (0), / _1 (1)). Thus, no more bits are communicated to solve 
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the standard communication game than are exchanged with the monotone communication 
game when the sets U and V are the inverse images of a monotone Boolean function. ■ 

In the next section we use the above result to derive a large lower bound on the monotone 
depth of the clique function. 

9.7.4 The Monotone Depth of the Clique Function 

In this section we illustrate the use of the monotone communication game by showing that 
in this game at least fi(vfc) bits must be exchanged between two players to compute the 
clique function / c ( ^ ucfc : S^"" 1 )/ 2 i-> B defined in Section 9.6 when k < {n/lf'K The 

inputs to / c i"q Ue k are variables associated with the edges of a graph on n vertices. If an edge 
variable ejj = 1, the edge between vertices i and j is present. Otherwise, it is absent. By 
Theorem 9.7.2, a lower bound of i7(v&;) on the number of bits that must be exchanged 
between the two players to compute f^l' k implies that f^* k has depth 0( vk). 

THE RULES OF THE GAME Fix n and k. The players in this communication game are each 
given sets of edges of graphs on n vertices. Player I is given a set of edges that contains a k- 

(n) 

clique (an input on which /Clique fc nas valu e 1, a positive instance) whereas Player II is given 
a set of edges that does not contain a fc-clique (an input on which it has value 0, a negative 
instance). The goal of the game is to exchange the minimum number of bits for the worst-case 
instances to permit the players to identify an edge variable that is 1 on a positive instance and 

on a negative one. This number of bits is the communication complexity of the game. 

To derive the lower bound on communication complexity, we restrict the graphs under 
consideration by choosing them so that every protocol must exchange a lot of data (this cannot 
make the worst cases any worse). In particular, we give Player I only fc-cliques, the set of 
graphs, CLQ, whose only edges are those between an arbitrary set of k vertices. We call Player 

1 the clique player. Also, we give Player II a (k — l)-coloring drawn from the set COL of all 
possible assignments of k — 1 colors to the n vertices of a graph G. The interpretation of a 
(k — 1) -coloring is that two vertices can have the same color only if there is no edge between 
them. Thus, any graph that has a (k — 1) -coloring cannot contain a /c-clique because the k 
vertices in such a subgraph must have different colors. We call Player II the color player. The 
goal now becomes for the two players to find a monochromatic edge (both endpoints have the 
same color) owned by the clique player. 

In the standard communication game players alternate exchanging binary messages. We 
simplify our discussion by assuming that each player transmits one bit simultaneously on each 
round. We then find a lower bound on the number of rounds and use this as a lower bound 
on the number of bits exchanged between the two players. 

AN ADVERSARIAL STRATEGY We describe an adversarial strategy for the selection of cliques and 
colorings that insures that many rounds are needed for the two players to arrive at a decision. 
To present the strategy, we need some notation. 

Let CLQo denote the set of graphs G = (V, E) on n vertices that contain only those edges 
in a fc-clique. It follows that CLQo contains (?) graphs. Let COL denote the set of (fc — 1)- 
colorings of graphs on n vertices, that is, COLo = {c \ c : V >— > [fc — 1]}, where [k — 1] 
denotes the set {1,2, . . ., k — 1}. It follows that COLo contains (k — I)" (k — 1) -colorings. 
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We execute a series of rounds. During each round each player provides one bit of infor- 
mation to the other. This information has the effect of reducing the uncertainty of the color 
player about the possible fc-cliques held by the clique player and of reducing the uncertainty of 
the clique player about the possible (fc — 1) -colorings held by the color player. The adversary 
makes the uncertainty large after each round so that the number of rounds needed will be large 
and a structure of the sets of cliques and colorings that can be analyzed will be maintained. 
The game ends when both players have found a monochromatic edge that is in a clique. 

Let Pt C V and M t C V denote the vertices that after the tth round are present in every 
fc-clique and missing from every fc-clique, respectively. (Let pt = \Pt\ and m t = \M t \.) Since 
vertices in M t are not in any cliques after the tth round, as we shall see, each such vertex can 
be assigned the same color as a "friend" after all vertices not in M t have been colored. Also, 
after the tth round the vertices in a fc-clique consist of vertices in V — M t of which those in 
Pt are the same for all such cliques. 

Let CLQ(V, P t , Mt) denote the set of fc-cliques containing P t but no vertex in M t . Let 
COL(V, Mt) denote the (fc — l)-colorings of vertices not in M t after the tth round. Then 
|CLQ.(y,P tJ Af t )| = (£:";) and \COL(V,M t )\ = (n - m^" 1 are the maximum 
numbers of fc-cliques and (fc — 1) -colorings that are possible after the tth round. Let CLQ t 
and COL t denote the actual number of cliques and colorings that are consistent with the 
information exchanged between players after the tth round. 

Given two sets A and B, A C B, we introduce a measure (j,b(A) = |A|/|i?| used in 
deriving our lower bound. For an element x G A, hb(A) is a rough measure of the amount 
of information that can be deduced about x. The smaller the value of hb{A), the more 
information we have about x. This measure is specialized to cliques and colorings after the tth 
round: 

McLQ(v,p t ,M s )(CLQ 4 ) = |CLQ t |/|CLQ(y,P t ,M t )| 
McoL ( v,M t )( COL *) = |COL t |/|COL(y,M t )| 

Since the color player does not know the identity of vertices in P t until after the tth 
round, its information about the clique held by the other player is measured by pt and 
A t CLQ(V,P t ,M t )(CLQ f ). Since the clique player only knows the color of vertices M t that 
are missing in all cliques after the tth round, its information about a (fc — 1) -coloring by the 
color player is measured by m t and /icOL(V,Af t )(COL t ). 

The number of rounds, T, is large if for t = T no edge present in all remaining cliques 
CLQt that is monochromatic in all remaining colorings COL4. We show that an adversary 
can choose the sets CLQt and COL t at each round so that many rounds are needed. 

SELECTION OF THE SETS CLQ T AND COL T BY THE ADVERSARY: Let the value of the bits sent by 
the clique and color players be &clq and &col> respectively. At the tth round the following 
algorithm is used to choose CLQ f and COL f : 

1) Let P = Pt-i, p = Pt, M = M t -\ and m = m t -\. Let CLQ 1 be the larger of the 
two subsets of CLQ t _i consistent with the values &clq = and &clq = 1- Thus, 
Mclq(v,p,a/)(CLQ ) > A i CLQ(y,p,A/)(CLQ t _ 1 )/2. 

2) Let CLQ be a collection of fc-cliques. Then the set of cliques q in CLQ containing the 
vertex v is denoted CLQ(u) = {q £ CLQ | v G q}. 
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Let CLQ = CLQ . As long as there exists v G V — P — M such that the following is 
true: 

Mclq(v,p,m)(CLQ(i;)) > rMCLQ(y,p,M)(CLQ) (9.2) 

(n — p — m) 

replace P by P* = P U {v}, p hyp* = p + 1, and CLQ by CLQ* = CLQ(u). Here 
(k-p- m)ixcLQ{v,P,M) (CLQ)/(n - p - m) is the average of Mclq( v.p.m) (CLQ(u) ) 
over all w G V — P — M. Thus, CLQ(v) has measure at least twice the average. 

Since \CLQ(V, P*, M)\ = {k-p-m) |CLQ(y, P, M)\/{n-p-m) after each iteration 
of this loop, the following bound holds: 

A t CLQ(y,p*,A/)(CLQ*) > 2/_t CL Q(y,p,A/)(CLQ) 

That is, the renormalized measure of the set of cliques after one iteration of the loop is at 
least double that of the measure before the iteration. 

After exiting from this loop let CLQ t = CLQ* and let Pt = P. Since Pt contains 
Pt — Pt-i more items than Pt-i, the following inequality holds: 

McLQ(v,p t ,A/ t) ( CL Qt) > 2 p '-^-^cLQ(y,p_ 1 ,M t _ 1 )(CLQ 1 ) 

> 2^-^ CLQ(y , Pt _ i , Mt _ l) (CLQ t _ 1 )/2 (9.3) 

Furthermore, for any vertex v remaining in V — P the condition expressed in (9.2) is 
violated, so that the following holds for v G V — P, where a = 2(k — p t — m t -i)/(n — 
Pt - m t -i): 

A f CLQ(v,P t ,M,_ 1 )({9 G CLQ* | v G q}) < a dj,CLQ(V,P t ,M t -i)(. CL Q,t)) ( 9 - 4 ) 

3) Let COL* = {c e COL t _! | c is 1-1 on P t }. That is, COL* is the set of (fc - 1)- 
colorings in COL t that assigns unique colors to vertices in P t . By restricting the (k — 1)- 
colorings we do not increase the number of rounds. In Lemma 9.7.3 we develop a lower 
bound on McoL(y,M,_ 1 )( C0Ij t) in terms of ^coL(y,A/ f _,)( COL t-i)- 

4) Let M = M t -i and m = TUt-i- Let COL and COL 1 denote the subsets of COL^ 
consistent with the values &col = and 6col = 1 > respectively. Let COL be the larger 
of these two sets. Then ^col(v,a/) (COL) > /i COL (v,M)(COL*)/2. 

5) The set COLt(u, u) = {c £ COL | c{u) = c(v)} contains those (k — l)-colorings in 
COL for which vertices u and v have the same color. 

As long as there exist u, v (z V — M such that the following is true: 

^COL(V,M)(COL t (u,lO) > 2^ COL( y, M )(COL)/(fc- 1) 

let w be one of u and v that is not in P (they cannot both be in P and have the same color 
because each coloring is 1-1 on P); replace M by M* =MU {w}, m by m* = m + 1, 
and COL by COL* = COL t (it, v). 

The term /xcoL(y,A/)(COL)/(fc — 1) is the average of A f cOL(V,M)(COL t (tt, v)) overall 
u and v in V — M. Thus, COL* contains (k — l)-colorings whose measure is at least 
twice the average. 



©John E Savage 9.7 Circuit Depth 445 

Since |COL(F,M*)| = |COL(V,M)|/(fc - 1) after each iteration of this loop, the 
following holds: 

Mcol(v,m*)(COL*) > 2hcol(v,m) (COL) 

That is, the renormalized measure of the set of (k — 1) -colorings after each loop iteration 
is at least double that of the measure before the iteration. 

After exiting from this loop, let Mt = M . Since Mj contains nit ~ m t-i niore items than 
M t _i, the following inequality holds: 

Mcol(V.^_,)(COL*) > 2 m «- m '-/, COL(v -, Ms _ i) (COL) 

> 2 m '- m '- / . COL(y , Mt _ i) (COL t *)/2 (9.5) 

6) Let COL t = COL*, M t = M, and CLQ t = {q e CLQ* \ M t D q = 0}. Thus, CLQ t 
does not contain any cliques with vertices in M t . In Lemma 9.7.4 we develop a lower 
bound on / u C LQ(v,n,M t _i) (CLQ t ) in terms of fJ.chQ(v,P t ,M t - l )( Clj Qt)- 

PERFORMANCE OF THE ADVERSARIAL STRATEGY We establish three lemmas and then derive 
the lower bound on the number of rounds of the communication game. 

LEMMA 9.7.3 After step 3 of the adversarial selection the following inequality holds: 



)(COL*) > ( 1 



1) 2 \ 
— J ^coL(y,Af t _,)(COL t _ 1 ) 



k 

Proof Recall the definition of COL t (w, v) = {c £ COL | c(u) = c(v)}. Consider the 
results of step 3 of the ith round in the adversary selection process. Because of the choices 
made in step 5 in the (t — l)st round and the choice of COLq, the following inequality 
holds for all t > and u,v £ V — M t -\ when u ^ v. 

VCOL(V,M,-,)( COL t( u > v )) < 2 ^COL(y,M t _ 1 )(COL t _ 1 )/(fc- 1) 

Because M t = M t -\ at step 3 of the tth round and FjCF- Mt, the same bound applies 
for u and v in Pt. 

The set COL t _i is reduced to COL* = {c£ COL t _i | c is 1 to 1 on P t } by discard- 
ing (k — 1) -colorings for which u and v are in Pt and have the same color. From the above 
facts the following inequalities hold (here instances of the measure /i carry the subscript 
COL(y,M t _i)): 

M(COLt) = fi({c e COL t _! | c is 1 to 1 on P t }) 

= M(COL t _!) - n I |J GOU{u,v)\ 

\u,v£P t , u^v J 

>A*(COL t _!)- ]T COL t (w, V ) 

u,v£P t , u^v 

>('-(")^t)^ col - ) 

>( 1 -H|±f)MCOL,.,, 
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From this the conclusion follows. ■ 
LEMMA 9.7.4 After step 6 of the adversarial selection the following inequality holds: 

1 —J McLQ<y,p 1 ,M t _,)(CLQt) 

Proof As stated in (9.4), after step 2 of the tth round of the adversary selection process we 
have for all v € V — Pt — M t -i the following inequality: 

i ( \~ \ 

MCLQ(V,P t ,M t _ 1 )({<7 G CLQ* I V G q}) < _ J _ *~\ ^CLQ(y-.P t .Af t -,)( CL Q?) 

Since M t C V — Pt, this bound applies to v £ M t . In the rest of this proof all instances of 
fj, carry the subscript CLQ(V, P t , M t _i). 

Since CLQ t = {g€ CLQ* | M t P\q = 0}, after step 6 the following inequalities hold: 

f*(CLQ t ) = /z({c G CLQ t | M t n ff = 0}) 

= /x(CLQ*) - n ( |J {c e CLQ* \veq}\ 

> (i - 2 (f-»-™«-i>;M MC lq:) 

\ (n—pt — mt-i) J 

>(i-^i)MCL Q ;) 

From this the conclusion follows. ■ 

The third lemma sets the stage for the principal result of this section. 

LEMMA 9.7.5 Letk > 2 andt < \/k/4 andt < n/(8k). Then the following inequalities hold: 

MCLQ(V,P t ,M t _,)( CL Qi) ^ 2 Pt_2t 
Mco L( y,M t )(COL t )>2 m '- 2t 
Proof The inequalities hold for £ = because |Uclq(V,p ) (CLQ ) = /icoL(V,M )(COL ) = 
1 . We assume as inductive hypothesis that the inequalities hold for the first t — \ rounds 
and show they hold for the ith round as well. 

Using the inductive hypothesis and (9.3), we have 

MCLQ(V,p t ,M t _ 1 )( CL Qt) ^ 2P«-^-'/. CLQ(v , Pt _ 1 , Mt _ l) (CLQ t _ 1 )/2 > 2?'- 2t +t9.6) 
Since ^cLQ(y j p t )(CLQ*) < 1, we conclude that pt < It — 1. Using this result, the 
assumption that t < yk/4, Lemma 9.7.3, and the inductive hypothesis, we have 

( 4t2 \ 

MCOL(V,A/ t _,)(COL*) > ( 1 - _ j /4COL(V,Aft_ I )(C'0 Ij t-l) 

- ( 1- 4( fc _ i ) j McoL(^f,-,)(COLt-i) 

> T^COL(V,M t _,)(COL t _i) 



>2 



TTlt — i — 2i + l 
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Combining this and (9.5) (note that in step 6 we let COL t = COL*), we have the first 
of the two desired conclusions, namely A i COL(y,A/ t )(COL f ) > 2 mt ~ 2t . This implies that 
nit < It. Applying this to the inequality in Lemma 9.7.4 and using the condition t < 
n/(8k), we get the following inequality: 

A f CLQ(V,P,,A/,_i)(C'LQ t ) > /i C LQ(V,P t ,M,_,)(CLQJ ! )/2 

Combining this with the lower bound given in (9.6), we have the second of the two desired 
conclusions, namely, A t CLQ(V,P t ,M t _ 1 )(CLQi) > 2 P2 ~ 2t . ■ 

We now state the principal conclusion of this section. 

THEOREM 9.7.3 Let 2 < k < (n/2) ' . Then the monotone communication complexity of the 
k-clique function f^y' c k is Q.(vk). 

Proof Run the adversarial selection process for T = vfc/4 steps to produce sets CLQ T , 
COL<r, Prp, and My. Below we show that CLQ T and COL^ are not empty. Give the 
clique player a fc-clique q G CLQ T and the color player a (k — l)-coloring c G COLy. To 
show that the two players cannot agree in T or fewer rounds on an edge in a clique in CLQ T 
that is monochromatic in all c G COL^, assume they can, and let (u, v) G q be that edge. 
If follows that both u and v are in Mt- But this cannot happen because, by construction, 
q n M T = 0. 

To show that CLQ T and COLy are not empty, observe that k < (n/2) 2 ' 3 and t < 
yk/A imply that t < n/(8fc). Thus, Lemma 9.7.5 can be invoked, which implies that 
Pt> m t < 2t < Vk/2 < k/2 < n. Invoking the definitions, the following inequalities also 
hold. 

CLQ t > 2 p '- 2 'CLQ(y,P t ,iV/ f _ 1 ) > 
COL t > 2 m *- 2 *COL(\/,Af t ) > 

Since the right-hand sides are non-zero, we have the desired conclusion. ■ 

9.7.5 Bounded-Depth Circuits 

As explained earlier, bounded-depth circuits are studied to help us understand the depth of 
bounded fan-in circuits. Bounded-depth circuits for arbitrary Boolean functions require that 
the fan-in of some gates be unbounded because otherwise only a bounded number of inputs 
can influence the output(s). 

In Section 2.3 we encountered the DNF, CNF, SOPE, POSE, and RSE normal forms. 
Each of these corresponds to a circuit of bounded depth. The DNF and SOPE normal forms 
represent Boolean functions as the OR of the AND of literals. The OR and each of the ANDs 
is a function of a potentially unbounded number of literals. The same statement applies to 
the CNF and POSE normal forms when AND and OR are exchanged. The RSE normal form 
represents Boolean functions as the EXCLUSIVE OR of the AND of variables, that is, without 
the use of negation. Again, the fan-in of the two types of operation is potentially unbounded. 

(n) 

As stated in Problems 2.8 and 2.9, the SOPE and POSE of the parity function /i have 
exponential size, as does the RSE of the OR function /y . In Problem 2.10 it is stated that 

(n) 

the function / m „ d 3 has exponential size in the DNF, CNF, and RSE normal forms. 
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In this section we show that every bounded-depth circuit for the parity function /^ over 
the basis containing the NOT gate on one input and the AND and OR gates on an arbitrary 
number of inputs has exponential size. Thus, the depth-2 result extends to arbitrary depth. 

BOUNDED-DEPTH PARITY CIRCUITS HAVE EXPONENTIAL SIZE We use an approximation method 
to derive a lower bound on the size of a bounded-depth circuit for /m . This method parallels 
almost exactly the method of Section 9.6.3. Starting with gates most distant from the output 
and progressing toward it, replace each gate of a given circuit by an approximating circuit. 
We show that as each replacement is made, the number of new errors it introduces is small. 
However, we also show that after all gates are approximated, the number of errors between the 
approximating circuit and /^ is large. This implies that the number of gates replaced is large. 

The approximation method used here replaces each gate in a circuit by a polynomial over 
GF(3), the three-element field containing {—1,0, 1}, with the property that if the variables 
of such a polynomial assume values in B = {0, 1}, the value of the polynomial is in B. For 
example, the polynomial lEi(l — #2) £3 has value 1 over B only when x\ = £3 = 1 and 
x 2 = and has value otherwise. Thus, it corresponds exactly to the minterm x{X2X^. Since 
every minterm can be represented as a polynomial of this kind, every Boolean function / can 
realized by a polynomial over GF(3) by forming the sum of one such polynomial for each 
of its minterms. A fo-approximator is polynomial of degree b that approximates a Boolean 
function. 

Although we establish the lower bound for the basis containing NOT and the unbounded 
fan-in AND and OR gates, the result continues to hold if the unbounded fan-in MOD3 function 
is added to the basis. (See Problem 9.41.) We begin by showing that the function computed 
by a circuit C containing size(C) gates cannot differ from its 6-approximator on too many 
input tuples. 

LEMMA 9.7.6 Let f : B n 1— > B be computed by a circuit C of depth d. There is a (2k) d - 
approximator circuit C computing f : B n 1— > B such that f and f differ on at most si ze(C)2 n ~ 
input n-tuples, where n is the number of inputs on which C depends and size(C) is the number 
of gates that it contains. 

Proof We construct a 6-approximator for C, b = (2k) , by approximating inputs (xi and 
Xi are approximated exactly on B by Xi and ( 1 — Xi)), after which we approximate gates all 
of whose inputs have been approximated until the output gate has been approximated. We 
establish the result of the lemma by induction. 

We treat the statement of the lemma as our inductive hypothesis and show that if it holds 
for d = D — 1, it holds for d = D. The hypothesis holds on inputs, namely, when d = 0. 
Suppose the hypothesis holds for d = D — 1. Since C has depth d, each of the inputs to the 
output gate has depth at most D — 1 and satisfies the hypothesis. The output gate is AND, 
OR, or NOT. Suppose it is NOT. Let g be the function associated with its input. We replace 
the NOT gate with the function (1 — g), which introduces no new errors. Since g and 1 — g 
have the same degree, the inductive hypothesis holds in this case. 

If the output gate is the AND of g\, gi, . . . , g m , it can be represented exactly by the 
function g\gr ■ ■ ■ g m . However, this polynomial has degree m(2k) if each of its inputs 
has degree at most (2k) ; this violates the inductive hypothesis if m > 2k, which may 
happen because the fan-in of the gate is potentially unbounded. Thus we must introduce 
some error in order to reduce the degree of the approximating polynomial. Since the OR of 
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5i.52. •■■ .5m can be represented by 1 - (1 - g 1 )(l - g 2 ) ■ ■ ■ (1 - g m ) using DeMorgan's 
Rules, both AND and OR of g\, g 2 , ■ • • . 5m have the same degree. We find an approximating 
polynomial for both AND and OR by approximating the OR gate. 

We approximate the OR of 51,52. ■ • ■ , g m by creating subsets S\, S 2 , ■ ■ ■ , Sk of {g\, g 2 , 
■ ■ ■ . gm}> computing fa = (X^,gs 9j) 2 > an d combining these results in 

OR(/i,/ 2 ,...,/ fc )=l-(l-/i)(l-J2) •■■(!- fk) 

The degree of this approximation is 2fc times the maximal degree of any polynomial in the 
set {51, 52. • ■ • , g m } or at most (2k) , the desired result. 

There is no error in this approximation if the original OR has value 0. We now show 
that there exist subsets Si, S2, ■ ■ ■ , Sk such that the error is at most 2™ when the original 
OR has value 1. Let's fix on a particular input ra-tuple x to the circuit. Suppose each subset 
is formed by deciding for each function in {g\, g 2 , . . . , g m } with probability 1/2 whether 
or not to include it in the set. If one or more of {g\ , g 2 , ■ ■ ■ , g m } is 1 on x, the probability 
of choosing a function for set whose value is 1 is at least 1/2. Thus, the probability that 
OR(/i, f 2 , . . . , fk) has value when the original OR has value 1 is the probability that each 
of /1. fi> ■ ■ ■ . fk has value 0, which is at most 2 . Since the sets {S\, S 2 , . . . , Sk} result 
in an error on input x with probability at most 2 , the average number of errors on input 
x, averaged over all choices for the k sets, is at most 2 and the average number of errors 
on the set of 2™ inputs is at most 2 n . It follows that some set \S\, S 2 , . . . , Sk} (and 
a corresponding approximating function) has an incorrect value on at most 2™ inputs. 
Since by the inductive hypothesis at most (size(C) — 1)2™ errors occur on all but the 
output gate, at most size(C)2 n ~ errors occur on the entire circuit. ■ 

The next result demonstrates that a ^/n-approximator (obtained by letting k = n 1 ' /2) 
and the parity function must differ on many inputs. This is used to show that the circuit being 
approximated must have many gates. 

LEMMA 9.7.7 Let f : B n 1— > B be a \fn- approximator for /m . Then, f and /^ differ on at 
least 2 n / 50 input n-tuples. 

Proof Let U C B n be the n-tuples on which the functions agree. We derive an upper 
bound on \U\ of /? = (49)2 n /50 that implies the lower bound of the lemma. We derive 
this bound indirectly. Since there are 3' ' functions g : U 1— > { — 1,0,1}, assign each one 
a different polynomial and show that the number of such polynomials is at most 3 , which 
implies that \U\ < [3. 

Transform the polynomial in the variables X\, X 2 , . . . , x n representing /i by mapping 
Xi to yi = 2xi — 1 . This mapping sends 1 to 1 and to — 1 . (Observe that y\ = 1 .) It 
does not change the degree of a polynomial. In these new variables /g, can be represented 
exactly by the polynomial y\y 2 ■ ■ ■ y n . 

Given a function g : U 1— > { — 1,0, 1}, extend it arbitrarily to a function g : B n 1— > 
{ — 1,0, 1 } . Let p be a polynomial in Y = { y\ , y 2 , . . . , y n } that represents g on U exactly. 
Let «/i, J/jj • • • yt t be a term in p for some constant c G { — 1 , 1 } . We show that if t is larger 
than n/2 we can replace this term with a smaller-degree term. 

Let T = {y^ , y i2 , . . . , y it } and T = Y - T. The term cy u y h ■ ■ ■ y it can be written 
as ell T, where by II T we mean the product of all terms in T. With y\ = 1 , this may 
be rewritten as cfl YH T. Since /^ = II Y, on the set U this is equivalent to cfH T, 
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which has degree \fn + n — \T\. Thus, a term cy^y^ • ■ -y% t of degree t > n/2 can 
be replaced by a term of degree y/n + n — t. It follows that the number of polynomials 
(and functions) representing functions whose values coincide with /^ on U is the number 
of polynomials of degree at most y/n + n/2. Since there are ( ra ) ways to choose a term 
containing j variables of Y , there are at most N ways to choose polynomials representing 
functions g : U >— >{ — 1,0, 1}, where N satisfies the following bound: 

Vn+(n/2) . 

»* E G" 

3=0 VJ 

For sufficiently large n, the bound to N is approximately 0.9772 • 2" < (49/50)2™. (See 
Problem 9.7.) Since each of the N terms can be included in a polynomial with coefficient 
— 1, 0, or 1, there are at most 3 distinct polynomials and corresponding functions g : 
U i— ► {— 1, 0, 1}, which is the desired conclusion. ■ 

We summarize these two results in Theorem 9.7.4. 

THEOREM 9.7.4 Every circuit of depth d for the parity function f^ has a size exceeding2 n ' /50 
for sufficiently large n. 

Proof Let U be the set of n-tuples on which /^ and its approximation / differ. From 



Lemma 9.7.6, |f| is at most size(C)2™ k . Now let A; = n l / ld /2. From Lemma 9.7.7 these 

j^2™ input n-tuples. Thus, size(c7)2" _fe > ^ 



two functions must differ on at least 552™ input n-tuples. Thus, size(C)2™ > tq2™ from 



ifhich the conclusion follows. 



Problems 

MATHEMATICAL PRELIMINARIES 

9.1 Show that the following identity holds for integers r and L: 

L 



L 


+ 


rL 


r+ 1 


_r+ 1 



9.2 Show that a rooted tree of maximal fan-in r containing k internal vertices has at most 
k(r — 1) + 1 leaves and that a rooted tree with / leaves and fan-in r has at most / — 1 
vertices with fan-in 2 or more and at most 2(1 — 1) edges. 

9.3 For positive integers n\, 712, 01, and a^, show that the following identity holds: 

n| n| (n x + n 2 ) 2 
cii a 2 ~ («i + a 2 ) 

9.4 The external path length e(T, L) of a binary tree T with L leaves is the sum of the 
lengths of the paths from the root to the leaves. Show that e(T, L) > L[log 2 L] — 
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Hint: Argue that the external path length is minimal for a nearly balanced binary tree. 
Use this fact and a proof by induction to obtain the external path length of a binary 
tree with L = 2 for some integer k. Use this result to establish the above statement. 

9.5 For positive integers r and s, show that \s/r\ (s mod r) + \_s/r\ (r — s mod r) = s. 
Hint: Use the fact that for any real number a, \a] — [a\ = 1 if a is not an integer and 
otherwise. Also use the fact that s mod r = s — \s/t\ ■ r. 

9.6 (Binomial Theorem) Show that the coefficient of the term x % y n ~ % in the expansion of 
the polynomial (x + y) n is the binomial coefficient ( n ). That is, 

(x+y) n = it( n ) xi y n ~ i 

9.7 Show that the following sum is closely approximated by 0.4772 • 2" for large n: 

(n/2) + v% 



=(n/2) 



Hint: Use the fact that n! can be very closely approximated by \/2irnn n e ™ to ap- 
proximate (™) . Then approximate a sum by an integral (see Problem 2.23) and consult 
tables of values for the error function erf (x) = L e - * dt. 

9.8 Let < x < y. Show that x + \fy — x > ^fy. 

CIRCUIT MODELS AND MEASURES 

9.9 Provide an algorithm that produces a formula for each circuit of fan-out 1 over a basis 
that has fan-in of at most 2. 

9.10 Show that any monotone Boolean function /*•"' : B n i— > B can be expanded on its 
first variable as 

f(x u x 2 ,.. .,x n ) = f(0,x 2 ,.. -,x n )W (xi A f(l,x 2 ,---,X„)) 

9.1 1 Show that a circuit for a Boolean function (one output vertex) over the standard basis 
can be transformed into one that uses negation only on inputs by at most doubling the 
number of AND, OR, and NOT gates and without changing its depth by more than a 
constant factor. 

Hint: Find the two-input gate closest to the output gate that is connected to a NOT 
gate. Change the circuit to move the NOT gate closer to the inputs. 

RELATIONSHIPS AMONG COMPLEXITY MEASURES 

9.12 Using the construction employed in Theorem 9.2.1, show that the depth of a function 
/ : B n i— ► B m in a circuit of fan-out s over a complete basis fl of fan-in r satisfies the 
inequality 

D s ,a(f) < Dn(f) (1 + W) + I(fi) log s (rC a ,n(f)/D)) 



452 Chapter 9 Circuit Complexity Models of Computation 

9.13 Show that there are ten functions / with L^i(f) = 2 that are dependent on two 
variables and that each can be realized from a circuit for / mux plus at most one instance 
of NOT on an input to /mux and on its output. 

9.14 Extend the upper bound on depth versus formula size of Theorem 9.2.2 to monotone 
functions. 



LOWER-BOUND METHODS FOR GENERAL CIRCUITS 

9.15 Show that the function f{x\,X2, ■ ■ ■ ,X n ) = X\ A x 2 A • ■ ■ A x n has circuit size \(n — 
l)/(r — 1)] and depth [log r n] over the basis containing the r-input AND gate. 

9.16 The parity function /I : B n i— > B has value 1 when an odd number of its variables 
have value 1 and otherwise. Derive matching upper and lower bounds on the size 

(n) 

and depth of the smallest and shallowest circuit(s) for /i over the basis B 2 - 

9.17 Show that the function f^ d4 defined to have value 1 if the sum of the n inputs 
modulo 4 is 1 can be realized by a circuit over the basis B 2 whose size is 2.5n + 0(1). 
Hint: Show that the function is symmetric and devise a circuit to compute the sum of 
three bits as the sum of two bits. 

9.18 Over the basis B-> derive good upper and lower bounds on the circuit size of the func- 



tions & ] :B n ^B and / 5 (n) : B n h-> B defined as 



/ 4 (n) = ((y + 2) mod 4) mod 2 
/ 5 (n) = ((y + 2) mod 5) mod 2 

Here y = X)j=i x i an d X^ an d + denote integer addition. 

9.19 Show that the set of Boolean functions on two variables that depend on both variables 
contains only AND-type and parity-type functions. Here an AND-type function com- 
putes (x a A y ) c for Boolean constants a, b, c whereas a parity-type function computes 
x © y © c for some Boolean constant c. 

9.20 The threshold function T t : B n 1— > B on n inputs has value is 1 if t or more inputs 
are 1 and otherwise. Show that over the basis B 2 that Cb 2 {t 2 ) > 2n — 4. 

9.21 A formula for the parity function /i™^ : B n 1— > B on n inputs is given below. Show 
that it has circuit size exactly 3(n — 1) over the standard basis when NOT gates are not 
counted: 

J ©.c = *^l ® *^2 " ' * X n C 

9.22 Show that f^J. has circuit size exactly 4{n — 1) over the standard basis when NOT gates 
are counted. 

9.23 Show that /m c has circuit size exactly l(n — 1) over the basis {A, ->}. 
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LOWER BOUNDS TO FORMULA SIZE 

9.24 Show that the multiplexer function /mux can be realized by a formula of size 32 p — 2 
in which the total number of address variables is 2(2 P — 1). 

Hint: Expand the function /mux as suggested below, where as > denotes the k com- 
ponents of a with smallest index and P = 2 P : 

A(a<*\ yP -L ...,yo) = f£U°P-i- ft&Xa^.VP-u ■ • ■ . VP/2), 

/4 P U - X 1) (« (p - 1) .?/P/2-l,...,?/0)) 

Also, represent /mux as shown below. 

/mux( a >2/i'2/o) = (a A 2/ ) V (oAyi) 

9.25 Show that Neciporuk's method cannot provide a lower bound larger than 0{n / log n) 
for a function on n variables. 

(n) 

9.26 Derive a quadratic upper bound on the formula size of the parity function /m over 
the standard basis. 

9.27 Neciporuk's function is defined in terms of an \n/rn\ X m matrix of Boolean variables, 
X = {xij}, m = [log 2 n~\ + 2, and a matrix S = {<?i,j} of the same dimen- 
sions in which each entry ctjj is a distinct m-tuple over B containing at least two Is. 
Neciporuk's function, N(X), is defined as 



*w=0*uA0 n 



Xk,l 



i,j k - 1 l such that 

Here ® denotes the exclusive or operation. Show that this function has formula size 
£l(n 2 / log n) over the basis Bi. 

9.28 Use Krapchenko's method to derive a lower bound of n on the formula size of the 
parity function /i™' : B n h-> B. 

9.29 Use Krapchenko's method to derive a lower bound of Q(t(n — t + 1)) on the formula 
size over the standard basis of the threshold function r t , 1 < t < n — 1 . 

9.30 Generalize Krapchenko's lower-bound method as follows. Let / : B n 1— ► £? and let 
-4 C /-'(O) and P C / -1 (l). Let Q = [q itj ] be defined by q id =_1 if x* G ^ and 
Xj € P are neighbors and (ftj = otherwise. Let P = QQ and P = Q Q. Then 
p r]S is the number of common neighbors to X r and a; s in B. The matrices P and 
P are symmetric and their largest eigenvalues, X(P) and A(P), are both non-negative 
andA(P) = A(P). Show that 

M/) > A(P) 

9.31 Under the conditions of Problem 9.30, let 
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where K(f) is the lower bound given in Theorem 9.4.2. Show that 

K(f) < D(f) < A(P) 
K(f) < D(f) < X(P) 

Hint: Use the fact that the largest eigenvalue of a matrix P satisfies 

x T Px 



\{P) = max 

x^O X 1 X 

Also, let Si be the sum of the elements in the ith column of the matrix Q. Show that 

l^ii S i = Z-/r,sP r ^ m 



LOWER-BOUND METHODS FOR MONOTONE CIRCUITS 

9.32 Consider a monotone circuit on n inputs that computes a monotone Boolean function 
/ : B n I— » B. Let the circuit have k two-input AND gates, one of them the output gate, 
and let these gates compute the Boolean functions g\, g%, ■ ■ ■ ,gk = /, where the AND 
gates are inverse-ordered by their distance from the output gate computing /. Since the 
function gj is computed using the values of X\, x 2 , ■ ■ ■ , X n , g\, . . . , <7j_i, show that gj 
can be computed using at most n + j — 2 two-input OR gates and one AND gate. Show 
that this implies the following upper bound on the monotone circuit size of/: 

Ca mon (f)<kn+( k ~ 

Let C/\ (/) denote the minimum number of AND gates used to realize / over the mono- 
tone basis. This result implies the following relationship: 

cfcw(/) = o ((c A (f)) 2 ) 

How does this result change if the gate associated with / is an OR gate? 

9.33 Show that the prime implicants of a monotone function are monotone prime impli- 
cants. 

9.34 Find the monotone implicants of the Boolean threshold function r t : B n i— > B, 
1 < t < n. 

9.35 Using the gate-elimination method, show that C'n mon (T 2 ) > 2n — 3. 

9.36 Show that an expansion of the form of equation (9.1) on page 420 holds for every 
monotone function. 

9.37 Show that the f^' k : B n ^ n ~ '' 2 i— > B can be realized by a monotone circuit of size 
0(n n ). 



9.38 Show that the largest value assumed by mm(\/k — 1/2, n/(2k)) under variation of A; 
isfi(n 1 / 3 ). 
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CIRCUIT DEPTH 

9.39 Show that the communication complexity of a problem (U, V), U, V C B n , satisfies 
C(U,V) < n + logj n, where logj n is the number of times that |~log 2 ] must be taken 
to reduce n to zero. 

Hint: Complete the definition of a protocol in which Player I sends Player II n — 
[log 2 n] bits on the first round and Player II responds with a message specifying 
whether or not its n-tuple agrees with that of Player I and if not, where they differ. 

9.40 Consider the communication problem defined by the following sets: 

U = {it | 3 divides the number of Is in u} 

V = {v | 3 does not divide the number of Is in u} 

Show that a protocol exists that solves this problem with communication complexity 
3flog 2 nl. 

9.41 Show that Theorem 9.7.4 continues to hold when the MOD3 function is added to the 
basis where MOD3 is the Boolean function that has value 1 when the number of Is 
among its inputs is not divisible by 3. 

Chapter Notes 

The dependence of circuit size on fan-out stated in Theorem 9.2.1 is due to Johnson et al. 
[150]. The depth bound implied by this result is proportional to the product of the depth and 
the logarithm of the size of the original circuit. Hoover et al. [138] have improved the depth 
bound so that it is proportional to (log,. s)Dn(f) without sacrificing the size bound of [150]. 

The relationship between formula size and depth in Theorem 9.2.2 is due to Spira [314], 
whose depth bound has a coefficient of proportionality of 2.465 over the basis of all Boolean 
functions on two variables. Over the basis of all Boolean functions except for parity and its 
complement, Preparata and Muller [259] obtain a coefficient of 1.81. Brent, in a paper on the 
parallelization of arithmetic formulas [58], has effectively extended the relationship between 
depth and formula size to monotone functions. (See also [359].) 

An interesting relationship between complexity measures that is omitted from Section 9.2, 
due to Paterson and Valiant [240], shows that circuit size and depth satisfy the inequality 

D a (f) > \c a (f)]ogC a (f)-0(C a (f)) 

The lower bounds of Theorem 9.3.2 on functions in Q23 are due t0 Schnorr [300], 
whereas that of Theorem 9.3.3 on the multiplexer function is due to Paul [244]. Blum [48], 
building on the work of Schnorr [302], has obtained a lower bound of 3(n— 1) for a particular 
function of n variables over the basis i?2. This is the best circuit-size lower bound for this 
basis. Zwick [374] has obtained a lower bound of An for certain symmetric functions over the 
basis Ui. Red'kin [274] has obtained lower bounds with coefficients as high as 7 for certain 
functions over the bases {A, -1} and {V, -1}. (See Problem 9.23.) Red'kin [276] has used the 
gate-elimination method to show that the size of the ripple-adder circuit of Section 2.7 cannot 
be improved. 



456 Chapter 9 Circuit Complexity Models of Computation 

The coefficient of Neciporuk's lower-bound method [230] in Theorem 9.4.1 has been im- 
proved upon by Paterson (unpublished) and Zwick [373] . Paul [244] has applied Neciporuk's 
method to show that the indirect storage access function has formula size il(n 2 /logn) over 
the basis Bi- Neciporuk's method has also been applied to many other problems, including the 
determinant [169], the marriage problem [126], recognition of context-free languages [241], 
and the clique function [304]. 

The proof of Krapchenko's lower bound [174] given in Theorem 9.4.2 is due to Pater- 
son, as described by Bopanna and Sipser [50]. Koutsoupias [172] has obtained the results of 
Problems 9.30 and 9.31, improving upon the Krapchenko lower bounds for the fcth thresh- 
old function by a factor of at least 2. Andreev [24], building on the work of Subbotovskaya 
[320], has improved upon Krapchenko's method and exhibits a lower bound of f2(n 2 ' 5_c ) on 
a function of n variables for every fixed e > when n is sufficiently large. Krichevskii [176] 
has shown that over the standard basis, r t requires formula size O(nlogn), which beats 
Krapchenko's lower bound for small and large values of t. 

Symmetric functions are examined in Section 2.11 and upper bounds are given on the 
circuit size of such functions over the basis {A, V, 0}. Polynomial-size formulas for symmet- 
ric functions are implicit in the work of Ofman [234] and Wallace [356], who also indepen- 
dently demonstrated how to add two binary numbers in logarithmic depth. Krapchenko [175] 
demonstrated that all symmetric Boolean functions have formula size 0(n ) over the stan- 
dard basis. Peterson [247], improving upon the results of Pippenger [248] and Paterson [241], 
showed that all symmetric functions have formula size 0(n 3 7 ) over the basis B 2 . Paterson, 
Pippenger, and Zwick [242,243] have recently improved these results, showing that over B 2 
and U 2 formulas exist of size 0(n ) and 0(n ), respectively, for many symmetric Boolean 
functions including the majority function, and of size O(n ii0 ) and 0(n . ), respectively, for 
all symmetric Boolean functions. 

Markov demonstrated that the minimal number of negations needed to realize an arbitrary 
binary function on n variables with an arbitrary number of output variables, maximized over 
all such functions, is at most \\og 2 (n +1)1- For Boolean functions (they have one output 
variable) it is at most |log 2 (".+ 1)J ■ Fischer [100] has described a circuit whose size is at most 
twice that of an optimal circuit plus the size of a circuit that computes /neg(^I) • ■ • . %n) = 
(xi, . . . ,x n ) and whose depth is at most that of the optimal circuit plus the depth of a circuit 
for /neg- Fie exhibits a circuit for /neg of size 0(n 2 logn) and depth O(logn). This is 
the result given in Theorem 9.5.1. Tanaka and Nishino [323] have improved the size bound 
on /neg to 0(n log n) at the expense of increasing the depth bound to 0(log n). Beals, 
Nishino, and Tanaka [32] have further improved these results, deriving simultaneous size and 
depth bounds of 0(n log n) and O(logn), respectively. 

Using non-constructive methods, a series of upper bounds have been developed on the 
monotone formula size of the threshold functions r t by Valiant [346] and Bopanna [49], 
culminating in bounds by Khasin [166] and Friedman [106] ofO(r nlogn) over the mono- 
tone basis. With constructive methods, Ajtai, Komlos, and Szemeredi [14] obtained polyno- 
mial bounds on the formula size r t over the monotone basis. Using their construction, Fried- 
man [106] has obtained a bound on formula size over the monotone basis of 0(t c n log n) for 
c a large constant. 

Over the basis B 2 , Fischer, Meyer, and Paterson [101] have shown that the majority func- 
tion T t ,t= \n/2], and other symmetric functions require formula size 51(n log n). Pudlak 
[264], building on the work of Hodes and Specker [136], has shown that all but 16 symmetric 
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Boolean functions on n variables require formula size Q(n log log n) over the same basis. The 
16 exceptional functions have linear formula size. 

Using counting arguments such as those given in Section 2.12, Gilbert [114] has shown 
that most monotone Boolean functions on n variables have a circuit size that is fi(2™/n 3 ' 2 ). 
Red'kin [275] has shown that the lower bound can be achieved to within a constant multi- 
plicative factor by every monotone Boolean function. 

Tiekenherinrich [330] gave a An lower bound to the monotone circuit size of a simple 
function. Dunne [87] derived a 3.5n lower bound on the monotone circuit size for the major- 
ity function. 

The lower bound on the monotone circuit size of binary sorting (Theorem 9.6.1) is due 
to Lamagna and Savage [188] using an argument patterned after that of Van Voorhis [351] for 
comparator-based sorting networks. Muller and Preparata [225,226] demonstrate that binary 
sorting over the standard basis has circuit size O(n). (See Theorem 2.11.1.) Pippenger and 
Valiant [253] and Lamagna [187] demonstrate an Q(nlogn) lower bound on the monotone 
circuit size of merging. These results are established in Section 9.6.1. The sorting network 
designed by Ajtai, Komlos, and Szemeredi [14] when specialized to Boolean data yields a 
monotone circuit of size 0(n log n) for binary sorting. 

The first proof that the monotone circuit size ofnx n Boolean matrix multiplication 
(see Section 9.6.2) is fi(n 3 ) was obtained by Pratt [256]. Later Paterson [238] and Mehlhorn 
and Galil [218] demonstrated that it is exactly n (2n — 1). Weiss [361] discovered a simple 
application of the function-replacement method to both Boolean convolution and Boolean 
matrix multiplication, as summarized in Corollary 9.6.1 and Theorem 9.6.5. (Wegener [360, 
p. 170] extended Weiss's result to include the number of ORs.) Wegener [357] has exhibited an 
n-input, n-output Boolean function (Boolean direct product) whose monotone circuit size is 
tt(n 2 ). Earlier several authors examined the class of multi-output functions known as Boolean 
sums in which each output is the OR of a subset of inputs. Neciporuk [231] gave an explicit 
set of Boolean sums and demonstrated that its monotone circuit size is f2(n 3 ' ). This lower 
bound for such functions was independently improved to Q(n ' ) by Mehlhorn [216] and 
Pippenger [250]. More recently, Andreev [23] has constructed a family of Boolean sums with 
monotone circuit size that is Q(n ) for every fixed e > 0. 

The first super-polynomial lower bound on the monotone circuit size of the clique function 
was established by Razborov [270]. Shortly afterward, Andreev [22], using similar methods, 
gave an exponential lower bound on the monotone circuit size of a problem in NP. Because the 
clique function is complete with respect to monotone projections [310,344], this established 
an exponential lower bound for the clique function. Alon and Bopanna [17], by strengthen- 
ing Razborov's method, gave a direct proof of this fact, giving a lower bound exponential in 
57 ((n/logn) ' j. The stronger lower bound given in Theorem 9.6.6, which is exponential 
in f^n 1 ' 3 ), is due to Amano and Maruoka [20]. They apply bottleneck counting, an idea of 
Haken [125], to establish this result. Amano and Maruoka [20] have also extended the approx- 
imation method to circuits that have negations only on their inputs and for which the number 
of inputs carrying negations is small. They show that, even with a small number of negations, 
an exponential lower bound on the circuit size of the clique function can be obtained. 

Having shown that monotone circuit complexity can lead to exponential lower bounds, 
Razborov [271] then cast doubt on the likelihood that this approach would lead to exponential 
non-monotone circuit size bounds by proving that the matching problem on bipartite graphs, 
a problem in P, has a super-polynomial monotone circuit size. Tardos [324] strengthened 
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Razborov's lower bound, deriving an exponential one. Later Razborov [273] demonstrated 
that the obvious generalization of the approximation method cannot yield better lower bounds 
than Q(n ) for Boolean functions on n inputs realized by circuits over complete bases. 

Berkowitz [37] introduced the concept of pseudo-inverse and established Theorem 9.6.9. 
Valiant [347], Wegener [358], and Paterson (unpublished — see [92,360]) independently im- 
proved upon the size of the monotone circuit realizing all pseudo-negations from 0(n 2 log n) 
to 0(nlog n) to produce Theorem 9.6.8. Lemma 9.6.9 is due to Dunne [90]. 

In his Ph.D. thesis Dunne [88] has given the most general definition of pseudo-negation. 
He shows that a Boolean function h is a pseudo-negation on variable x% of a Boolean function 
/ on the n variables X\, . . . ,x n if and only if ft, satisfies 

f(x)\ x . =0 < h(XL...,Xi- U Xi+i,.. „i„) < f(x)\ Xi =i 

Here f(x)\ Xi=a denotes the function obtained from / by fixing x% at a. 

Dunne [89] demonstrated that HALF-CLIQUE CENTRAL SLICE is NP-complete (The- 
orem 9.6.10) and showed that the central slices of the HAMILTONIAN CIRCUIT (there is a 
closed path containing each vertex once) and SATISFIABILITY are NP-complete. As men- 
tioned by Dunne [91], not all NP-complete problems have NP-complete central slices. 

The concept of communication complexity arose in the context of the VLSI model of 
computation discussed in Chapter 12. In this case it measures the amount of information that 
must be transmitted from the inputs to the outputs of a function. The communication game 
described in Section 9.7.1 is different: it characterizes a search problem because its goal is to 
find an input variable on which two n-tuples in disjoint sets disagree. 

Yao [366] developed a method to derive lower bounds on the communication complexity 
of functions / : X x Y i— > Z. He considered the matrix of values of / where the rows 
and columns are indexed by the values of X and Y . He defined monochromatic rectangles 
as submatrices in which all entries are the same. He then established that the logarithm of 
the minimal number of disjoint rectangles in this matrix is a lower bound on the number of 
bits that must be exchanged to compute /. (This result shows, for example, that the identity 
function / : B 2n i— > B defined for f(x,y) = 1 if and only if x% = yi for all 1 < % < n 
requires the exchange of at least n + 1 bits.) Savage [288] adapted the crossing sequence 
argument from one- tape Turing machines (an application of the pigeonhole principle) to derive 
lower bounds on predicates. Mehlhorn and Schmidt [220] show that functions / : X x Y i— > 
Z for which Z is a subset of a field have a communication complexity that is at most the rank 
of the two-dimensional matrix of values of/. 

The development of the relationship between the circuit depth of a function and its com- 
munication complexity follows that given by Karchmer and Wigderson [157]. Karchmer [156] 
cites Yannakakis for independently discovering the connection £>n (/) = C(/ (0), / (1)) 
of Theorem 9.7.1 for non-monotone functions. Karchmer and Wigderson [157] have exam- 
ined si-connectivity in this framework. This is the problem of determining from the adja- 
cency matrix of an undirected graph G with n vertices and two distinguished vertices, s and 
t, whether there is a path from s to t. When characterized as a Boolean function on the edge 
variables, this is a monotone function. Karchmer and Wigderson [157] have shown that the 
circuit depth of this function is fi((logra) 2 /loglogn), a result later improved to il((logn) 2 ) 
independently by Hastad and Boppana in unpublished work. Raz and Wigderson [269] have 
shown via a complex proof that the clique problem on n -vertex graphs studied in Section 9.7.4 
has monotone communication complexity and depth Cl(n). The simpler but weaker lower 
bound for this problem developed in Section 9.7.4 is due to Goldmann and Hastad [116]. 
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Furst, Saxe, and Sipser [107] and, independently, Ajtai [13] obtained the fitst strong lower 
bounds on the size of bounded-depth circuits. They demonstrated that every bounded-depth 
circuit for the parity function /i has superpolynomial size. Using a deeper analysis, Yao 

[368] demonstrated that bounded-depth circuits for /i have exponential size. Hastad [124] 
strengthened the results and simplified the argument, giving a lower bound on circuit size of 
2 n ( n ' /d / l °) for circuits of depth d. 

Razborov [272] examined a more powerful class of bounded-depth circuits, namely, cir- 
cuits that use unbounded fan-in AND, OR, and parity functions. He demonstrated that the 
majority function T n , 2 has exponential size over this larger basis. Smolensky [313] simplified 
and strengthened Razborov's result, obtaining an exponential lower bound on the size of a 
bounded-depth circuit for the MOD p function over the basis AND, OR, and MOD,j when p 
and q are distinct powers of primes. We use a simplified version of his result in Section 9.7.5. 
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CHAPTER 



Space-Time Tradeoffs 



An important question in the study of computation is how best to use the registers of a CPU 
and/or the random-access memory of a general-purpose computer. In most computations, the 
number of registers (space) available is insufficient to hold all the data on which a program 
operates and registers must be reused. If the space is increased, the number of computation 
steps (time) can generally be reduced. This is an example of a space-versus-time tradeoff. In 
this chapter we examine tradeoffs between the number of storage locations and computation 
time using the pebble game and the branching program model. 

The pebble game assumes that computations are done with straight-line programs in a 
data-independent fashion. Each such program is modeled by a directed acyclic graph. A 
pebble on a vertex indicates that its value is in a register. The goal of the game is to pebble the 
output vertices of the graph with numbers of pebbles (space) and steps (time) that are minimal, 
that is, neither can be reduced without increasing the other. 

A branching program models data-dependent computation under the assumption that in- 
put variables assume a bounded number of values. Such a program is defined by a directed 
acyclic multigraph (there may be more than one edge between vertices) that specifies the order 
in which inputs are read. Time is the length of the longest path in a multigraph and space is 
the logarithm of its number of vertices. 

For both models we present techniques to derive lower bounds on the exchange of space S 
for time T. For most problems examined here these exchanges are of the form ST = f2(n ), 
where n is the size of the problem input. Upper bounds on ST are obtained by evaluating S 
and T for particular algorithms. 

Because the branching program is more general than the pebble game, it is more difficult 
to obtain good lower bounds with it, and for this reason we begin with the pebble game. In 
addition, the pebble game is appropriate for problems such as integer multiplication, convo- 
lution, and matrix multiplication on which only straight-line programs are used. For other 
problems, such as merging and sorting, the algorithms used typically involve branching and 
for them the branching program is the better model. 

We also exhibit extreme results for the pebble game by showing that the time to pebble 
some graphs goes from minimal to exponential in the size of the graphs when the number 
of pebbles changes by 1, a warning against trying too hard to minimize the number of CPU 
registers used in a computation. 
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10. 1 The Pebble Game 

The pebble game is a game played on directed acyclic graphs (DAGs), which capture the 
dependencies of straight-line programs studied in Chapters 2 and 6. Algorithms for many 
important problems, such as the FFT and matrix multiplication, are naturally computed by 
straight-line programs. In the pebble game pebbles are placed on vertices of a DAG to indicate 
that the value associated with a vertex resides in a register. Pebbles are placed on vertices in a 
data-independent order. 

In this game a pebble can be placed on an input vertex at any time and on any non-input 
vertex whose immediate predecessor vertices carry pebbles. The goal of the game is to place 
pebbles on each output vertex. A pebble can be removed from a vertex, including an output 
vertex, at any time after it has been pebbled. These rules are summarized below. 

The rules of the pebble game are the following: 

• (Initialization) A pebble can be placed on an input vertex at any time. 

• (Computation Step) A pebble can be placed on (or moved to) any non-input vertex only 
if all its immediate predecessors carry pebbles. 

• (Pebble Deletion) A pebble can be removed at any time. 

• (Goal) Each output vertex must be pebbled at least once. 

Placement of a pebble on an input vertex models the reading of input data. Placement of 
a pebble on a non-input vertex corresponds to computing the value associated with the vertex. 
The removal of a pebble models the erasure or overwriting of the value associated with the 
vertex on which the pebble resides. 

Allowing pebbles to be placed on input vertices at any time reflects the assumption that 
inputs are readily available. (The multi-level pebble game introduced in the next chapter 
models the case in which each access to secondary storage is expensive.) The condition that 
all predecessor vertices carry pebbles when a pebble is placed on a vertex models the natural 
requirement that an operation can be performed only after all arguments of the operation 
are located in main memory. Moving (or sliding) a pebble to a vertex from an immediate 
predecessor reflects the design of CPUs that allow the result of a computation to be placed in 
a memory location holding an operand. 

A pebbling strategy is the execution of the rules of the pebble game on the vertices of a 
graph. We assign a step to each placement of a pebble, ignoring steps on which pebbles are 
removed, and number the steps consecutively from 1 to T, the time or number of steps in 
the strategy. The space, S, used by a pebbling strategy is the maximum number of pebbles 
it uses. The goal of the pebble game is to pebble a graph with values of space and time that 
are minimal; that is, the space cannot be reduced for the given value of time and vice versa. 
In general, it is not possible to minimize space and time simultaneously. We derive upper and 
lower bounds on the possible exchanges of space for time. 

10. 1 . 1 The Pebble Game Versus the Branching Program 

As stated above, the branching program model introduced in Section 10.9 handles data- 
dependent computation, and is thus a more general model than the pebble game. However, 
there are three reasons to study the pebble game. First, the branching program assumes that 
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Figure 1 0. 1 An FFT graph F^ ' on n = t inputs. Input vertices are on the bottom; edges are 
directed upward. Four pebbles are shown on the graph when pebbling the leftmost output. 



input variables are held in an auxiliary random-access machine so that it can access them in 
arbitrary order, a condition not imposed on pebble games. It follows that inputs to a pebble 
game can be fetched in advance, since the times at which they are needed are data-independent. 
Second, lower bounds on the exchange of space for time with branching programs are harder to 
obtain due to their increased flexibility. Third, straight-line programs are used in many prob- 
lems, such as integer multiplication, convolution, matrix multiplication, and discrete Fourier 
transform, and the pebble game gives the relevant lower bounds. For other problems, such as 
sorting and merging, the branching program model is the model of choice since these problems 
are typically solved with branching programs. We expand upon this topic in Section 10.9.1. 

10.1.2 Playing the Pebble Game 

The pebble game is illustrated in Fig. 10.1 by pebbling the FFT graph F { *> with eight inputs 
and 24 non-input vertices. This graph has the property that the set of paths from input vertices 
to an output vertex forms a complete balanced binary tree. (See Fig. 10.2.) It follows that we 
can pebble the FFT graph by pebbling each of the trees. Since two of the eight outputs share 
the same tree at the next lower level, we can pebble two outputs at the same time. 

Binary trees form an important class of graphs. A complete balanced binary tree of depth 
4 is illustrated in Fig. 10.2. (The depth of a directed tree is the number of edges on the longest 
path from an input vertex to the output (or root) vertex.) This tree has 16 input vertices and 
one output vertex. A complete balanced binary tree of depth 0, T(0), consists of a single 
vertex. A complete balanced binary tree of depth d > 0, T(d), consists of a root vertex and 
two copies of T(d — 1) whose root vertices each have one edge directed from them to the 
root vertex of the full tree. Thus in Fig. 10.2 the complete balanced binary tree of depth four 
T(4) is constructed of two copies of T(3), which in turn are each constructed of two copies of 
T(2), and so on. It follows by straightforward induction that a complete balanced binary tree 
of depth d has 2 d inputs and 2 d+1 — 1 vertices. (See Problem 10.8.) 
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Figure 10.2 A complete balanced binary tree T(4) of depth 4 on 16 inputs. At least five 
pebbles are needed to pebble it. 



The binary tree of Fig. 10.2 can be pebbled with five pebbles by pebbling the vertices in 
the order shown. Five pebbles are needed at the time when vertex 27 is pebbled. After one 
pebble is moved to vertex 30, the two outputs of the FFT of Fig. 10.1 to which vertices 15 and 
30 are attached can be pebbled. This tree-pebbling strategy can be repeated on all remaining 
outputs. It is a general strategy for pebbling complete balanced binary trees. 

This pebbling strategy, explained in detail in the next section, demonstrates that an FFT 
graph on n = 2 " inputs can be pebbled with no more pebbles than are needed to pebble the 
trees with n leaves contained within it, namely, k + 1 . In the next section we show that this 
is the minimum number of pebbles needed to pebble a complete balanced binary tree on 2 
leaves. This FFT pebbling strategy for the graph in Fig. 10.1 pebbles each vertex on the third 
and fourth levels once, each vertex on the second level twice, and each vertex on the first level 
four times. It is clear that inputs must be repebbled if the minimum number of pebbles is used. 
This is an example of space-time tradeoff. We shall derive a lower bound on the exchange of 
space for time for this problem. 

In the next section we also examine the minimum space required to pebble graphs. In the 
subsequent section we describe a graph that exhibits an extreme tradeoff. This graph requires 
a pebbling time exponential in the size of the graph when the minimum number of pebbles is 
used but can be pebbled with one move per vertex if one more pebble is available. 

After studying extreme tradeoffs we define a flow property of functions that, if satisfied, 
implies a lower bound on the product (5*+ \)T (or a related expression) involving the space S 
and time T needed to compute such functions. This test is used to show that many standard 
algorithms are optimal with respect to their use of space and time. 



10.2 Space Lower Bounds 



In this section we derive lower bounds on the minimum space S'min(G') needed to pebble a 
graph G for balanced binary trees, pyramids, and FFT graphs, a representative set of graphs. 
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Any pebbling strategy will need to use at least as many pebbles as this minimum value of space. 
It can be shown that no bounded-degree graph on n vertices requires more than 0(n/ log n) 
space (see Theorem 10.7.1) and that some graph requires space proportional to n/logn (see 
Theorem 10.8.1). 

Complete balanced binary trees were introduced in the previous section. We now derive a 
lower bound on the space (number of pebbles) needed to pebble them. 

LEMMA I 0.2. 1 Any pebbling strategy for the complete balanced binary tree of depth k, T(k), 
requires at least S mm (T(k)) = k + 1 pebbles and 2 +1 — 1 steps. There is a pebbling strategy of 
T(k) that uses exactly this many pebbles and steps. 

Proof Proof of the lemma requires a proof that k + 1 pebbles are necessary as well as a 
strategy that pebbles the tree with k + 1 pebbles and makes one pebble placement per 
vertex. Let's first develop a pebbling strategy. 

T(0) obviously can be pebbled with one pebble in one step. Assume that T(k — 1) can 
be pebbled with k pebbles in 2 — 1 steps. To pebble T(k), advance a pebble to the root of 
its left subtree (a copy of T{k — 1)) using k pebbles and 2 — 1 steps. Leave a pebble on its 
root. Then pebble the right subtree of T(k) using k pebbles and 2 — 1 steps. (A snapshot 
of T(k) when the number of pebbles is maximal under this pebbling strategy is shown in 
Fig. 10.2.) Thus, T{k) is pebbled in 2 x (2 k - 1) + 1 = 2 k+l - 1 steps with k + 1 pebbles. 

The lower bound is derived by showing that no pebbling strategy can use fewer than 
k + 1 pebbles. The argument used is the following: initially no path to the root of the tree 
(or output) from input vertices carries a pebble because there are no pebbles on the graph. 
At the end of the computation a pebble resides on the root and all paths to the root carry 
pebbles. Therefore, there must be a first point in time at which there is a pebble on each 
path to the root. This must be a time at which a pebble is placed on an input vertex, thereby 
closing the last path from that input to the root. Such a path is highlighted in Fig. 10.2. 
Before a pebble is placed on the input vertex of this path, all other paths from input vertices 
to the root carry pebbles. Each of these paths enters the highlighted path via one edge. Thus, 
it follows that prior to the placement of this last pebble there is at least one pebble on the 
tree for each of the k edges on this path except for the input vertex. Consequently, at least 
k + 1 pebbles are on the tree when the last pebble is placed on it. ■ 

The FFT graph on 2 inputs, i* 1 ' ', is defined recursively in terms of two sub-FFT graphs 
.F' ' as shown in Section 6.7.2. It follows that this graph contains many copies of the tree 
T(k) as a subgraph (see Problem 10.11) and that any pebbling strategy for F^ ' requires at 
least k + \ pebbles. Many other straight-line computations involve tree computations. 

A pyramid graph on m inputs, P(m) (P(6) is shown in Fig. 10.3), is obtained by slicing 
an m x 771 mesh into two parts along its diagonal, splitting all diagonal nodes (which are now 
inputs), and then directing edges from the diagonal vertices in one part to the one remaining 
unsplit corner vertex in this part of the graph. Edges are directed up, a convention we use 
throughout this chapter. P{m) has n = m(m + l)/2 vertices. (See Problem 10.1.) 

We apply to the pyramid graph P(m) the lower bounding argument used in the preceding 
proof based on closing the last open path to the output vertex. 

LEMMA 10.2.2 Any pebbling strategy for the m-input, n-vertex (n = 777(777+ \)/2) pyramid 
graph P{m) requires at least m pebbles; that is, a minimum space S m i n (P(m)) = m > \/2n — 
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Figure 1 0.3 The pyramid graph on six inputs. 



1. There exists a pebbling strategy that pebbles P(m) with m pebbles using one pebble placement 
per vertex. 

Proof The lower-bound proof again uses the fact that there is a first time at which all paths 
from an input to the output carry pebbles. Highlighted in Fig. 10.3 is a last path to carry 
a pebble. Prior to the placement of this last pebble, all paths to the output carry pebbles. 
Thus, with the placement of the last pebble there must be at least as many pebbles on the 
pyramid graph as there are vertices on a path from an input to the output, namely, m, and 
m > V2n- 1. (See Problem 10.1.) 

With m pebbles, the vertices can be pebbled in levels by first placing pebbles on each of 
the m inputs. Pebbles are then advanced to vertices on the second level from left to right, 
and this process is repeated at all levels to complete the pebbling. Each vertex is pebbled 
once with this strategy. ■ 

In general, it is very hard to determine the minimum number of pebbles needed to pebble 
a graph. In terms of the complexity classes introduced in Chapter 8, we model this problem as 
a language consisting of strings each of which contains the description of a graph G = (V, E), 
a vertex v £ V, and an integer S with the property that the vertex can be pebbled with S or 
fewer pebbles. The language of these strings is PSPACE-complete (see Section 8.12). 

10.3 Extreme Tradeoffs 

We now show that extreme space-time tradeoff behavior is possible. We do this by exhibiting a 
family of graphs, H\, H2, . . . , Hk, ■ ■ ■ (Fig. 10.4), that requires a number of steps exponential 
in the size of the graph when the minimum number of pebbles is used but only one step per 
vertex when one more pebble is available. This illustrates that excessive minimization of the 
number of registers used by programs can be harmful! 

H\ has one input and one output vertex and an edge connecting them, as shown in 
Fig. 10.4. For k > 2 the fcth graph, Hi~, has k + 1 output vertices and is constructed from 
one copy of -fffc_i, a tree (on the left) with k inputs, a two-level bipartite graph (on the top 
right) with k inputs and k + 1 outputs, and a chain of k vertices that connects the tree to the 
outputs of -f/fc-i and the open vertex. (A bipartite graph is a graph in which the vertices are 
partitioned into two sets and edges join vertices in different sets.) 

We summarize our pebbling results for this family of graphs below. Here n\ is the factorial 
function with value n\ = n ■ (n — I) ■ (n — 2) ■ . . . ■ 2 ■ I. 
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k + 1 




H H 2 H k 

Figure 10.4 A family of graphs exhibiting an extreme tradeoff. 



THEOREM 1 0.3. 1 The graph H k has N(k) = 2fc 2 + 5k - 6 vertices for k > 2. Any pebbling 
strategy for the graph H k requires at least k pebbles, k = @{yN(k)). Any strategy to pebble H k 

with k pebbles requires at least (fc+l)!/2 = 2v v / steps, whereas there exists a 

pebbling algorithm using k + 1 pebbles that pebbles each vertex of H k once. 

Proof Consider a pebbling strategy that uses k + 1 pebbles to pebble Hk. For the case of 
k = 1 , Hk can be completely pebbled with one move per vertex. This is also true for H 2 
because we can move a pebble to the open vertex connected to the bipartite graph using two 
pebbles, from which we can advance two of our three pebbles to the bottom layer of the 
bipartite graph and have one additional pebble with which to pebble the output vertices. 
Note that this pebbling strategy allows us to pebble output vertices of H 2 from left to right 
with three pebbles. 

Assume that we can pebble the outputs oiH k -\ from left to right with k pebbles without 
pebbling any vertex more than once. Then to pebble Hk, advance a pebble to the root of 
the tree on the left and then pebble the outputs of Hk-i from left to right using k pebbles 
while keeping one additional pebble on the chain. Advance this pebble along the chain until 
it reaches the open vertex. At this point k pebbles can be advanced to the bottom row of 
vertices in the bipartite graph and the remaining pebble used to pebble outputs from left to 
right. This shows that our assumption holds. 

The minimum number of pebbles needed to pebble Hk is at least k because at least this 
many are needed to pebble the tree on the left. To show that this value can be achieved, we 
give a recursive pebbling strategy. Observe that H\ can be pebbled with fc = 1 pebbles. To 
pebble Hk, assume that we can pebble anyone output of Hk-\ with fc— 1 pebbles. Advance 
a pebble to the root of the left tree and then advance it along the chain by pebbling output 
vertices of Hk-\ from left to right with fc — 1 pebbles. Move a pebble to the open vertex 
and then to all vertices on one side of the bipartite graph. Any one output vertex can now 
be pebbled. However, doing so requires that one vertex on the bottom side of the bipartite 
graph lose its pebble. Thus, no other output vertex can be pebbled without repebbling the 
tree and all vertices of Hk-\. 
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As this pebbling strategy demonstrates, to pebble an output vertex, all k pebbles must 
move to the bottom of the bipartite graph, thereby removing all pebbles from other vertices 
of Hk- Let M(k) be the number of pebble placements to pebble Hk with k pebbles. It 
follows that to pebble each of the (k + 1) outputs of Hk with k pebbles, we must pebble 
each output of Hk-i with k — 1 pebbles. Thus, 

M(k) > (fc+1) X M(k- 1) 

> {k + \)k{k- 1) • • -3 • 1 = (k+ l)!/2 

which provides the desired lower bound. 

Let the graph H k have N(k) vertices. Then N(l) = 2, N(2) = 12 and N(k) = 
N(k — 1) + 4k + 3 for k > 3. A straightforward proof by induction shows that N(k) = 
Ik 1 + 5k-6 (see Problem 10.13). 

To show that M(k) > (k + l)!/2 is exponential in N(k) = 2k 1 + 5k — 6, note that 
pi = p-(p — l)-...-3-2-l, which is at least (p/2) &' 2 ' since each of the first p/2 terms is at 
least p/2. Thus, M(k) > -5[(fc + l)/2p +1 >/ 2 Also, it is easy to see that N(k) < 3(fc+l) 2 
for k > 1. Since this implies y/N(k)/3 < (k + 1), we have that 



M(k) > .5 [(^(p)/2 
vhich is exponential in N(k). ■ 



(-/JVW73)/2 



Many vertices in the graph i^^. have a fan-in fc. A new family {Gk} of graphs with fan-in 
2 can be obtained by replacing the tree on the left in H^ with the pyramid graph of Fig. 10.3 
and replacing the bipartite graph on the top with a new graph (see Problem 10.14). This new 
graph exhibits an exponential jump in the time to pebble the graph but at a value of space that 
is the fourth root of the number of vertices in Gk- 



10.4 Grigoriev's Lower-Bound Method 



In this section we present a method for developing lower bounds on the exchange of space for 
time in the pebble game. These lower bounds are typically of the form (S + \)T = f2(n 2 ), 
where S, T, and n are the space, time, and the size of the input to the problem, and are similar 
in spirit to those of Theorem 3.6.1. Because they assume a less general model of computation 
(the pebble game instead of the RAM), lower bounds are easier to derive. 

The lower bounds use as a measure the maximum amount of information that can flow 
from a subset of the inputs to a subset of the outputs, and are much easier to derive than are 
lower bounds on circuit size for the circuit model. Although the results are stated for straight- 
line computations, they apply to all "input-output-oblivious" computations by finite-state ma- 
chines: computations in which inputs are read and outputs produced at times independent of 
the values of the input variables. (See Problem 10.20.) 

1 0.4. 1 Flow Properties of Functions 

We start by defining a flow property of functions. (See Fig. 10.5.) A function / : A n >—> A m 
has a large information flow from input variables in X\ to output variables in Y\ if there are 
values for input variables in Xq = X — X\ such that many different values can be assumed by 
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Y 



X 



Figure 10.5 A function / that has a large information flow from input variables in X\ to 
output variables in Y\ for some values of input variables in Xg = X — X\. 



outputs in Y\ as inputs in X\ range over all their \A\ ' ' values. This flow property is also used 
in Section 12.7 to derive lower bounds on the exchange of area for time in the VLSI model of 
computation. 

DEFINITION 10.4.1 A function f : A n i-> A m has a w(u,v) -flow if for all subsets X\ and 
Y\ of its n input and m output variables, with \X\ | > u and \Y\\ > v, there is a subfunction 
h of f obtained by making some assignment to variables of ' f not in X\ (variables in Xq) and 
discarding output variables not in Y\ such that h has at least \A\ W u ' v ' points in the image of its 
domain. 

The exponent function w(u,v) is a nondecreasing function of both of its arguments: in- 
creasing u, the number of variables that are allowed to vary, can only increase the number of 
values assumed by /; the same is true if v is increased. 

An important class of functions are the (a, n, m,p)-independent functions defined below. 

DEFINITION 10.4.2 A function f : A n i— > A m is an (a, n, m, p) -independent function y»r 

a > 1 andp < m if it has a w(u, v)-flow satisfying w(u, v) > (v/a) — 1 for n — u + v < p. 

We illustrate the independence property of a function with matrix multiplication: we show 
that the function defined by the product of two nxn matrices is (1, 2n 2 , n 1 , n)-independent. 
In Section 10.5.4, we show that a stronger property holds for matrix multiplication. 

The proof of the independence property ofnxn matrices uses the permutation matrices 
described in Section 6.2. An nxn permutation matrix is obtained by permuting either the 
rows or columns of the nxn identity matrix. When a permutation matrix B multiplies another 
matrix A on the right (left) to produce AB (BA), it permutes the columns (rows) of A. 

LEMMA 10.4.1 The matrix multiplication function f A x B : lZ 2n i— > TZ n over the ring 1Z is 
( 1 , 2n 2 ,n 2 ,n) -independent. 
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Proof Let C = AB be the product ofnxn matrices A and B. Consider any set X of 
input variables (entries of A and B) and any set Y\ of output variables (entries of C) such 
that |Xo| + \Y\ | = n. The outputs in Y\ fall into at most \Y\ | columns of C and the inputs 
in X fall into at most \Xq\ columns of A. It follows that at least n — \Xq\ columns of A 
contain only variables in X\ . Fix the entries in B so that it forms a permutation matrix that 
permutes the columns of A containing only elements in X\ onto columns of C containing 
elements ofY\. (We are free to make the best assignment of variables in B, whether in Xq 
or X\.) It follows that each output variable in Y\ is assigned to an input variable of A in X\ 
by this permutation. Thus these output variables are free to assume \R,y l ' different values. 
Since this is more than |72.|' Y ' 1 ' _1 , it follows that f^ B is (1, 2n 2 , n 2 , n) -independent. ■ 

As this result illustrates, for any set of y\ outputs of the matrix multiplication function and 
any set of Xq of its inputs satisfying Xq + ])\ < p, there is some assignment to these inputs such 
that there is a large flow of information from the complementary set of inputs, X\, to any set 
y\ of its outputs. 

10.4.2 The Lower-Bound Method in the Basic Pebble Game 

The following theorem provides a lower bound on the exchange of space for time. Its proof 
uses a variant of the pigeonhole principle. Since the pebbling of vertices is assumed to occur 
sequentially, time is divided into intervals in which the number of output vertices pebbled, b, is 
chosen to be a small multiple of the number of pebbles, S, used in pebbling. The pigeonhole 
principle is used to show that a large number of inputs must be pebbled in each interval. 
In particular, we show that if the number of inputs pebbled inside an interval is small, the 
number of inputs outside the interval is large enough that there is a large flow from the inputs 
outside the interval to the outputs inside it. However, the flow cannot be any larger than can 
be supported by the number, S, of vertices carrying pebbles just before the interval. Thus, the 
number of input variables outside the interval is small, which implies that the number inside is 
large. That is, many inputs must be pebbled within each interval. Multiplying by the number 
of intervals in which b outputs are pebbled provides the lower bound. 



THEOREM 10.4.1 Let f : A n i— > A m have an w{u,v)-flow and let it be realized by a strai 

■ a basis {h : A r i— > A s \ r,s > 1}. For arbitrary b < m, every pebbling of 



me program over , 



every DAG for f requires space S and time T satisfying the inequality 

T > [m/b\ (n - d) 

where d is the largest integer such that w(d, b) < S. 

Proof Assume that G = (V,E) is pebbled with S > 1 pebbles in T > 1 steps. Let 
T[ < T be the number of times that input vertices are pebbled. (This is generally more 
than the number of input variables.) 

Given a pebbling of G with S pebbles, group the consecutive pebbling steps into in- 
tervals, the first \rn/b\ of which contain b pebbled outputs and one of which contains 
m — b{ \m/b\ ) pebbled outputs. 

Consider an arbitrary interval I in which b outputs are pebbled. Let Y\ be these outputs 
and let .To and X\ be the number of inputs pebbled inside and outside the interval, respec- 
tively. By definition, there is an assignment to the Xq inputs such that that the 6 = |Yi| 
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outputs have at least |^4.|"'( Xl ' 6 ) different values. If w(x\,b) > S, the outputs Yj assume 
more values than can be taken by the S pebbles in use just prior to the start of I. Because 
the values of variables in Y\ are determined by the inputs pebbled in I, which are fixed, and 
the values under the S pebbles, this contradicts the definition of/. It follows that x\ can be 
no larger than d, where d is the largest value such that w(d, b) < S. Thus the number of 
inputs pebbled in X, xq, satisfies xq > (n — d). 

Since there are [77T./6J intervals in which b outputs are pebbled, the number of times 
that inputs are pebbled, T/, is at least [™/&J {n — d). ■ 

Grigoriev [121] established the above theorem for (l,n, m,p) -independent functions. We 
restate as a corollary a slightly revised version of his theorem for (a, n, m,p) -independent 
functions. 

COROLLARY I 0.4. 1 Let f : A n <— > A m be (a, n, m, p) -independent and let it be realized by a 
straight-line program over a basis {h : A r t— > A s \ r, s > 1}. Every pebbling of every DAG for f 
requires space S and time T satisfying the inequality 

\a(S+ 1)]T> mp/4 

Proof An (a, n, m,p) -independent function on n inputs has a w(u, v)-Row satisfying 
w(u,v) > {v/a) — 1 for n — u + v < p, where x = n — u > 0. Since b can be 
freely chosen, let b = \a(S + 1)]. Thus, (b/a) — 1 > S for (n — d) + b < p, which 
contradicts the requirement that w{d, b) < S. It follows that (n — d) + b > p or that 
{n — d) > p — \a(S + 1)]. With the inequality [m/a;] > (m — x + l)/x (see Prob- 
lem 10.2), the following lower bound follows from Theorem 10.4.1: 

( m -r a (S+i)1 + i)(p-ra(S+i)1) 

MS +1)1 

Since p < m, if \a(S + \)~\ < p/2, the desired lower bound follows. On the other hand, 

if \a(S + 1)1 > p/2, \a(S + l)]T > mp/2 since T > m. ■ 

It is possible that a function / : A n i— > A m is not (a, n, m,p) -independent but a sub- 
function g : A r i— > A s is (a, r, s,p)-independent for r < n and s < m. (Subfunctions are 
defined in Section 2.4.) As shown in Problem 10.18, the lower bound for the subfunction g 
applies to /. 

Lower bounds on space-time exchanges can also be derived using properties of the graphs 
to be pebbled. For example, if a graph contains a superconcentrator (defined in Section 10.8), 
lower bounds on the product can be derived on (S + 1)T in terms of the number of inputs of 
the graph. (See Problem 10.28.) 

As mentioned at the beginning of this section, Theorem 10.4.1 is much more general 
that it appears. In Problem 10.20 the reader is asked to show that the lower bound holds for 
"input-output-oblivious" finite-state machines, FSMs that compute functions but read their 
inputs and produce their outputs at data-independent times. Problem 10.21 asks the reader to 
establish that pebblings of straight-line computations can be translated directly into computa- 
tions by finite-state machines. 
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Figure 10.6 Pebbling an inner product graph with three pebbles. 



10.4.3 First Matrix Multiplication Bound 

The Grigoriev lower-bound method is well illustrated by matrix multiplication. We established 
its independence property in Section 10.4.1. In this section we apply it to Corollary 10.4.1. 
The upper bound stated in the following theorem follows from the development of an algo- 
rithm for matrix multiplication that uses three pebbles and executes at most 4n 3 steps. This 
algorithm, based on the standard matrix multiplication algorithm of Section 6.2.2, forms each 
of the n 2 inner products defined by the product of two n x n matrices using three pebbles, as 
suggested in Fig. 10.6, and An — 1 steps. 

THEOREM 10.4.2 Every pebbling strategy for straight-Line programs computing the matrix multi- 
plication function f^xB 
the following inequality: 



B 1 



The standard algorithm for multi x 



B n for n x n matrices requires space S and time T satisfying 

(5+l)T>n 3 /4 

n x n matrices uses space and time satisfying 
(S+ l)T= 16n 3 

Those familiar with fast non-standard matrix multiplication algorithms such as Strassen's 
fast matrix algorithm (Section 6.3) may find this result surprising. Whereas one learns that 
the standard matrix multiplication algorithm is not optimal with respect to computation time, 
the above result states that the standard matrix multiplication algorithm is nearly optimal with 
respect to the space-time product. 

In Section 10.5.4 we specialize Theorem 10.4.1 to the flow properties of matrix multipli- 
cation, giving a stronger result: that the space and time for matrix multiplication must satisfy 
the inequality ST 2 = tt(n 6 ). 



10.5 Applications of Grigoriev's Method 



Given the above results, to derive a lower bound on \a(S + 1)]T using Corollary 10.4.1 
it suffices to establish the independence property of a function. We apply this idea in this 
section to convolution, cyclic shifting, integer multiplication, matrix-vector multiplication, 
matrix inversion, and solving linear equations. We apply related arguments to derive lower 
bounds for the discrete Fourier transform and merging. Finally, we apply Theorem 10.4.1 to 
derive a lower bound on space— time exchanges for matrix-matrix multiplication that improves 
upon the bound of Section 10.4.3. Where possible we also derive upper bounds on space-time 
tradeoffs. 
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10.5.1 Convolution 

The wrapped convolution on strings of length n over the ring 1Z, /, 



(n) 

wrapped 



K 2 



K n , 



defined in Problem 6.19. It can be characterized by the following product of a circulant matrix 
with a vector (see Section 6.2): 



Wo 

Wn-l 



"■<> 


U n -l 


U n -2 ■ 
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U () 


U n -1 ■ 


u 2 
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U n -3 


U n -4 ■ 


■ U n -l 


U n -l 


U n -2 


U n -5 ■ 


U 



VQ 

l'\ 

V n -\ 



(10.1) 



./; 



Lemma 10.5.1 demonstrates (2, 2n, n, n/2)-independence for the wrapped convolution 
TV 1 function by showing that for any set Xq of inputs there is a way to put 



(n) 



wrapped 



n 



in 



|Yi|/2 of the inputs in X — X into a one-to-one correspondence with |Yi|/2 entries in any 
set Y\ of outputs. This is established by setting one component of v to 1 and the rest to 0. 



An) 

'wrapped 



lZ 2n i— > lZ n over the ring 1Z is 



An) 



LEMMA I 0.5. 1 For n even, the wrapped convolution jl_ 
(2, In, n, n/ 2) -independent. 

Proof Consider subsets Xq and Y\ of the inputs X and outputs Y of / w ' r 'a PP ed satisfying 
|-X"o | + \Y\\ = p = n/2. For /^pped to ^ e (2, 2n, n, n/2)-independent, there must be 
an assignment to input variables in Xq such that the output variables in Y\ have more than 
|7^|(l*i|/2)-i Jj stmct values as the input variables of f^ la ppcd in X\ = X — Xq range over 
all possible values. 

(n) 

As shown above, /trapped is defined by a matrix- vector product w = Mv, M a cir- 
culant matrix, in which each row (column) is a cyclic shift of the first row (column). Let 
e = \Xo fl {wo> U\, . . . , u n —\}\. Thus, every row of M contains the same number e of 
entries from Xq. Also, n — e inputs are in X\ = X — Xq. The entries in X\ are free to vary. 

Each output in Y\ corresponds to a row of M. The number of instances of input 
variables from X\ in these rows is |Yi|(n — e). Since these rows have n columns, there 
is some column, say the tth, containing at least the average number of instances from X\ . 
This average is |5^i|(l — e/n) > |Yi|/2. (The instances of variables from Xi in a column 
are distinct.) It follows that by choosing the ith component of v, v t , to be 1 and the 
others to be 0, at least |3^i|/2 of the inputs in X\ are mapped onto outputs in Y\. Since 
these inputs (and outputs) can assume \R,y Yl >' 2 different values, it follows that /„".' pcd is 
(2, 2n, n, n/2)-independent. ■ 

This implies the lower bound stated below. The upper bound follows from the standard 
matrix-vector algorithm for the wrapped convolution using the observation that an inner prod- 
uct can be done with three pebbles, as suggested in Fig. 10.6. 

THEOREM 10.5.1 The time T and space S required to pebble any straight-line program for the 
standard or wrapped convolution must satisfy the following inequality: 

(S+l)T> n 2 /\G 

This lower bound can be achieved to within a constant multiplicative factor for S = O ( 1 ) . 
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10.5.2 Cyclic Shifting 

The cyclic shifting function f^ } cUc : £™+n°g«l ,_> B n defined in Section 2.5.2 is a sub- 
function of many functions, including integer multiplication and squaring (see Section 2.9.5), 
integer reciprocal (see Section 2.10.1), and powers of integers (see Problems 2.34 and 2.35). 

Cyclic shifting is another good example of a problem for which a lower bound on the 
exchange of space and time exists. The method used to establish the independence properties 
of this function can be generalized to the class of transitive functions. (See Problem 10.22.) 

We redefine / cyc i; c here. Let k = [log n\ . The input variables of / cvc i ic are segmented 
into two groups, an n-tuple x = (x n _i, . . . , X\, Xq) of value variables and a fc-tuple s = 
(sk-i, . . . , Si, So) of control variables. The control variables specify the integer \s\: 

\s\ = s fc _ 1 2 fc - 1 + --- + si2 1 + s 



s | is the number of places by which the value inputs must be shifted left cyclically to produce 
he output n-tuple y = (y n -u ■■■, Vi,Vo)- That is, f C yl Uc {x, s) = (y), where 

y 3 = £(i-| s |) mod n for < j < [logn] - 1 (10.2) 



(n) 

A circuit to implement / cvc i ic is given in Section 2.5.2 that cyclically shifts x left by 2° places 
for each of those values of j, < j < [log n] — 1, such that Sj = 1. 

The independence properties of the cyclic function are shown by demonstrating that some 
permutation of the input vector x aligns unselected inputs with selected outputs. 

LEMMA 10.5.2 /^j c : B"+P<*»] ,_> B n is(2,n+ [log n] , n, n/2) -independent. 

Proof Consider subsets Xq and Y t of the inputs X and outputs Y of / c ™ii c satisfying 
1^0) | + l^i I = P = n/2. For / c ™ lic to be (2, n + [log n\ , n, n/2) -independent, there must 
be an assignment to input variables in X such that the output variables in Y\ have more 
than |B|" y ''' 2 ^ -1 distinct values as the input variables of f^ ' lic in Xi = X — X Q range 
over all possible values. 

Let Xq contain e elements from x. Let yi G Y\. As s runs through all possible shift 
values, yt is made equal to every one of the inputs in x. For n — e of these shifts yi is 
set equal to an input in X\ = X — X . (For example, if n = 6 and e = 2, say with 
X\ = {xq,x^, X4, x$} and Y\ = {y 2 , 2/3, 2/5}, then as s ranges over all of its values, each 
of the three yi in Vj is assigned four different variables in Xi .) Thus, the number of input 
variables assigned to outputs, summed over all cyclic shifts, is \Y\ \(n — e). Since there are 
n cyclic shifts, for some shift the number of variables in X\ that are matched with outputs 
in Y\ is at least the average of this quantity; that is, at least \Y\ |(1 — e/n) > |Yi|/2. Thus, 
some shift sets at least \Y\ \/2 inputs in X\ to outputs in Y\. Since these outputs can assume 
\B\ ' Yl >' 2 different values, it follows that / c "i ic is (2, n + [log n\ , n, n/2)-independent. ■ 

THEOREM 1 0.5.2 Every pebbling strategy for straight-line programs computing the cyclic shifting 
function / c ™i ic : B n+ ' s n 1— > B n requires space S and time T satisfying the inequality 

(S+1)T> n 2 /\G 
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An algorithm exists to compute / c ™ nc that uses space 0{n) and time 0{nlogn), namely, that 
satisfies the inequality 

(S+ \)T=0{n 1 \ogn) 
Proof We leave the upper-bound proof to the reader. (See Problem 10.30.) ■ 

We now apply this result to integer multiplication. 

10.5.3 Integer Multiplication 

To apply Grigoriev's method to the binary integer multiplication function /^, lt : B 2n i— > B 2n 
of Section 2.9, we assemble a collection of results to show that with the proper encoding of one 
of its two arguments, / lnu i t computes the logical shifting function / shift (see Lemma 2.9.1) 
and when n is even the logical shifting function / shift contains the cyclic shift function /cyclic 

as a subfunction (see Lemma 2.5.2). Thus, f^it contains /cyclic as a subfunction. We use 
this fact to obtain a lower bound on the space-time product for integer multiplication. 



THEOREM 10.5.3 Let n be even. Every pebbling strategy for straight-line programs computing the 

binary integer multiplication function /^j t : B 2n i— > B 2n requires space S and time T satisfying 
the following inequality: 

(S+l)T>n 2 /64 

An algorithm exists for multiplying n-bit integers using space O (log n) and time 0(n ), namely, 
that satisfies 

(S+ 1 )T = (3(n 2 log 2 n) 

Proof The lower-bound argument is given above. The upper bound follows from a pebbling 
of an integer multiplication circuit to multiply n-bit binary integers u and v. The circuit is 
based on the following standard expansion of their product: 

V 3 U V 2 U ViU V U 

V3U1 v 2 u\ V\U\ v Q u\ 

W3U2 V 2 U 2 ViU 2 v u 2 

U3U3 v 2 u 3 W1U3 v Uj 

To construct a circuit we use the observation that the number of Is in the jth column is the 
jth component, Wj, of the convolution w = u ® v. (See Section 6.7.4.) 

To compute Wj we use the counting circuit /count ■ ^™ l— * & ' los n ' of Section 2.11 on 
n inputs to count the number of Is among the products u r v s of the Boolean variables u r 
and v s in the sum 

Wj = \ u r * v s for < j < 2n — 2 

r+s—j 

To compute the 2n-bit product we add the binary representations for Wo, W\, ■ . . , w 2n - 2 
in a set of {In — 1) ripple adders, adding Wj to the sum a{j) = 2^o<i<i— 1 Wi ^ '' as 
suggested in Fig. 10.7, where we omit the counting circuits used to compute the values of 

W ,...,W 2n - 2 . 
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Figure 1 0.7 A multiplication circuit that can be pebbled in 0(n") time and 0(log~ n) space. 
The counting circuits that generate Wo,Wi, . . . , W2n—2 are not shown. 



Each counting function can be pebbled with 0(n) steps using 0(log n) pebbles with- 
out repebbling vertices. (See Problem 10.10.) After the counting circuit is pebbled, pebbles 
remain on their outputs until their values have been used elsewhere in the multiplication 
circuit. 

The value of Wj is represented by a fc-tuple, k < |~log 2 n] . The value of cr(j) is repre- 
sented by at most |~log 2 (ft(2 J — 1))] < j + [log 2 n] bits since it is the sum of at most ft 
j-bk binary numbers. Because Wj is added after the first j bits, the pebbles on these bits can 
be discarded. Only |~log 2 ft] bits of the running sum and a like number for Wj are needed to 
hold values on the inputs to the ripple adder. A fixed additional number of pebbles suffices 
to pebble the internal vertices of the adder. On completion of the sum only [log 2 ft] pebbles 
are needed. They are used to hold the portion of the running sum that is used in the next 
stage of addition. 

For each value of j, < j < 2(n — 1), 0(log ft) steps are executed in the ripple adder 
and O(n) steps are executed in a counting circuit. Consequently, 0(log n) pebbles and 
(9(ft 2 ) time suffice to compute the product of n-bit binary numbers. ■ 

In Section 10.13.2 we show that a lower bound of fl(n 2 / log' ft) applies under the branch- 
ing program model. The stronger lower bound of f2(n 2 ) derived here reflects the extra con- 
straints imposed on the pebble game, namely that inputs are read and computations performed 
at data-independent times. 

Similar results apply to the squaring function /square since, as shown in Lemma 2.9.2, 

f (3n+l 



/square contains / n " u i t as a subfunction. (See Problem 10.32.) 



Similar results also apply to the reciprocal function /. 



(") 



B n i— > B n since, as shown 



in Lemma 2.10.1, / rec L contains as a subfunction the squaring function /square for m 
|n/12j - 1. (See Problem 10.33.) 



10.5.4 Matrix Multiplication 

In this section we show that the matrix multiplication function is richer than the other func- 
tions examined above in that it exhibits a stronger space— time lower bound than given in 
Theorem 10.4.2. After we derive a lower bound on the function w(u, v) we specialize Theo- 
rem 10.4.1 to this case, thereby deriving the stronger lower bound. 
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LEMMA 10.5.3 The matrix multiplication function f 



AxB 



n 2 



lZ n over the ring 1Z has 



a w(u, v)-flow, where w(u, v) satisfies the following lower bound: 

w(u,v) > (v - {In 2 - u) 2 /4n 2 )/2 

Proof Let C = AB be the product ofnxti matrices A and B. We establish this result by 
using characteristic functions to identify the outputs in C in Y\ and the inputs in A and B 
in X\, as indicated below. Here the indices i and j range over < i,j < n — 1: 



' i.j 



eYi 



otherv 



'<../ 



a it j e Xi 

otherwise 



Pi.j 







b id £ X l 
otherwise 



Let A, B, and C denote the matrices [ay], [A,j']> an d [&i,j]> respectively. Denote by \A\, 
\B\, and |C| the number of Is in the three corresponding matrices. Note that \A\ + \B\ = 
\X l \md\C\ = \Y 1 \. 

The fcth n x n cyclic permutation matrix P{k) is the n x n identity matrix in which 
the rows are rotated cyclically k — 1 times. For example, the following 3x3 matrix is -P(3). 

1 

1 

1 

Let D be an n x n matrix. The matrix P(k)D consists of the rows of D shifted cyclically 
down k — 1 places. Similarly, the matrix DP(k) consists of the columns of D shifted 
cyclically left k — 1 places. 

Let B(k) be the matrix B obtained by multiplication on the left by A = P(k). Sim- 
ilarly, let A(k) be the matrix A obtained by multiplication on the right by B = P(k). 
Then, a 1 value for the (i,j) entry in A(k) and B(k) identifies a variable in X\ that is 
mapped to an output variable of C through its multiplication by P{k). 

Let D and E be n x n matrices whose entries are drawn from the set {0, 1}. We denote 
by D n E the n x n matrix whose (i, j) entry is 1 if dij 
be the n x n matrix whose (i,j) entry is 1 if either dij 
identity applies: 



1. Similarly, let D U E 
1 or e»j = 1. The following 



i.j 



\DUE\ + \DC\E\ 



\D\ 



\E\ 



Since \D U E\ < n for n x n matrices, the following inequality holds: 



|£>n.E| > \D\ + \E\ 



Also, since \D f~l E\ > we have 



|£>l 



\E\ > IDU-EI 



(10.3) 



(10.4) 



(10.5) 



Theu^w, v)-Row of matrix multiplication is large if for some choice ofr or s \C(lA(r)\ 
or \C (~l B(s)\ is large. This follows because choosing A to be the rth cyclic permutation 
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makes many variables of B in X\ match entries in C in Yj , or choosing B to be the sth 
cyclic permutation makes many variables of A in X\ match entries in C in Y\ . When an 
input and output variable match, the latter assumes the value of the former. Thus, all the 
variation in the former is reflected in the latter. 

LetQ = |CTl A(r)\ + \C n B(s)\. Then the w(u,v) -Row is at least Q/2. Applying 
(10.5) and then (10.4) to Q, we have the following inequalities: 

Q > \C n (A(r) U B(s)) \>\C\ + \A{r) U B{s)\ - n 2 

Applying (10.3) to |.A(r) U B(s)\ yields the following lower bound on Q: 

Q > \C\ + \A(r)\ + \B(s)\ - \A(r) n B(s)\ - n 2 (10.6) 

But |C| = |Yi|, \A(r)\ = \A\,\B(s)\ = \B\, and |A| + |B| = \Xi\. We now show that 
there are values for r and s such that \A(r) PI B(s)\ is at most |A||S|/n 2 . 
Consider the following sum: 

n n 
r=l s=l 

Since A(r) and B(s) are formed by the rth and sth cyclic shift of columns of A and rows 
of B respectively, each 1 in A is aligned once with each 1 in B. It follows that 

S= \A\\B\ 

As a consequence, there are some r and s such that \A(r)C]B(s) | is at most S/n 2 . Applying 
this result in (10.6), we have the following lower bound on Q: 

Q>\Y l \ + \A\ + \B\ - \A\\B\/n 2 - n 2 

Since \X\ | = \A\ + \B\ is fixed, the above lower bound on Q is minimized by maximizing 
A||£?| under variation of | A \ . This maximum occurs when \A\ = |Xi|/2. Consequently 
we have the following lower bound on Q: 

Q>\n-n 2 (l |Xi1 ' 



In 2 
Since w(u, v) > Q/2 for u = \X\ \ and v = \Y\ |, we have desired the conclusion. ■ 

We now apply this result and Theorem 10.4.1 to derive a stronger result for matrix multi- 
plication than was obtained earlier using its (1, 2n 2 , n 2 , n) -independence property. 



THEOREM 10.5.4 Every pebbling strategy for straight-Line programs computing the matrix multi- 
plication function SavB 
the following inequality: 



plication function f\^ B ■ B 2n i— > B n forn x n matrices requires space S and time T satisfying 



ST 2 > n 6 /3 
The standard algorithm for multiplying n x n matrices uses space and time satisfyih 

ST 2 = 48 n 6 
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Proof From Lemma 10.5.3 we have that the matrix multiplication function has a w(u, v)- 
flow, where 

w(u,v) > (v - (2n 2 - u) 2 /4n 2 )/2 

Applying Theorem 10.4.1 to this problem with b = 3S, we seek the largest integer d such 
that w(d, b) < S, which must satisfy the bound 

(35 - {In 1 - d) 2 /4n 2 ) /2 < S 

This implies that (In 2 — d) > 2n\/S. From Theorem 10.4.1, the time to pebble the graph 
satisfies 

T>2y/Sn[n 2 /3S\ 

>2v f Sn{n 2 -3S+l)/3S 

If S < n 2 /27, T > {\6^f2n i )/ [27 y/~S) or ST 2 > (.35)n 6 . On the other hand, since 
T > 3n 2 just to pebble inputs and outputs, if S > n 2 /27, then ST 2 > n 6 /3. ■ 

10.5.5 Discrete Fourier Transform 

The discrete Fourier transform (DFT) is defined in Section 6.7.3. We derive upper and lower 
bounds on the space-time product needed to compute this function. 

LEMMA I 0.5.4 The n-point DFT function F n : TV 1 t— > TV 1 over a commutative ring TZ is 
(2, n, n, n/ 2) -independent for n even. 

Proof As shown in equation (6.23), the DFT is defined by the matrix-vector product 
[u! y ]a,, where [w 1 - 1 ] is a Vandermonde matrix. To show that the DFT function is (2, n, n, 
n/2) -independent, consider any set Y\ of outputs (corresponding to rows of [w lJ ]) and any 
set Xq of inputs (corresponding to columns) whose values are to be fixed judiciously, where 
p = |Xo| + |Yi| = n/2. We show that the outputs in Y\ have at least |72.|' y ''' 2 values as we 
vary over the remaining inputs. 

It is straightforward to show that the submatrix of [w 13 ] defined by any | Y\ \ rows and any 
\Y\\ consecutive columns is non-singular. (Its determinant is that of another Vandermonde 
matrix. Show this by letting the row and column indices be T\,T%, . . . , T\ya an d s,s -\- 
l,...,s+|Yi| — 1, respectively, and demonstrating that w TiS can be factored out of the ith 
row when computing its determinant.) Our goal is to show that some consecutive group of 
columns corresponds to at least \Y\ \/2 inputs of a in X\. 

Divide the n columns of [w %3 \ into |~n/|Yi|] groups of consecutive columns with \Y\\ 
inputs in each group except possibly the last, which may have fewer. There are n — \Xq\ 
inputs that may vary. Since there are \n/\ Y\ |] groups, by an averaging argument some group 
contains at least (n—|X |)/[n/|Yi|] of these inputs. Since |~n/|Yi|] < (n+|Yi| — l)/|Y"i |, 
we show that (n - |A" |)/in/|Yi|l > |Yi|/2forp= n/2. Observe that (n - |X |)/(n + 
\Yi\ - 1) > l/2i£2n-2\X \ > n+ |Yi| - 1 orn > \X \+p- 1 , which holds because 
\X 1 <p<n/2. 

Since the submatrix defined by k consecutive columns and any k rows where [~|Yi|/2] < 
k < | Y\ | is non-singular, it follows that any subset of \\Y\ |/2] columns has full rank. Thus, 
the submatrix contains a non-singular \\Y\ |/2] x \\Y\ |/2] matrix. When all inputs outside 



480 



Chapter 10 Space-Time Tradeoffs 



Models of Computation 



of these columns are set to zero, the |~|Y"i|/2] outputs have |72.|" ''"' values, or F n is 
(2, n, n, n/2) -independent. ■ 

The space-time lower bound stated below follows from Corollary 10.4.1. 

THEOREM 10.5.5 To pebble any straight-line program for the n-point DFT over a commutative 
ring 1Z requires space S and time T satisfying the following: 

(S+1)T> n 2 /\G 

when n is even. The FFT graph on n = 2 inputs can be pebbled with space S and time T 
satisfying the upper bound 

T < 4n 2 /(S - log 2 n) + nlog 2 S 

Thus, (S + \)T = 0(n 2 ) when 2log 2 n < S < (nj log 2 n) + log 2 n. 

Proof This lower bound can be achieved up to a constant factor by a pebbling strategy 
for the FFT algorithm, as we now show. Denote with p 1 ' ' the n-point FFT graph (it has 
n inputs), n = 2 . (Figures. 6.1, 6.7, and 10.8 show 4-point, 16-point, and 32-point 
FFT graphs.) Inputs are at level and outputs are at level d. We invoke Lemma 6.7.4 
to decompose F*- ' at level d — e into a set of top 2 2 e -point FFT graphs above the 
split, {-F t J | 1 < j < 2 e }, and a set of 2 e 2 d_e -point FFT graphs below the split, 

{F t ■ e | 1 < j < 2 e }, as suggested in Fig. 10.8. In this figure the vertices and edges have 
been grouped together as recognizable FFT graphs and surrounded by shaded boxes. The 
edges between boxes identify vertices that are common to pairs of FFT subgraphs. 



F (2) 




Figure 10.8 Decomposition of the FFT graph F^ ' into four copies of F^ ' and eight copies 
of F^ ' . Edges between bottom and top sub-FFT graphs are fictitious; they identify overlapping 
vertices between sub-FFT graphs. 
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A good strategy for pebbling the vertices of an FFT graph is to pebble the top FFT 
graphs {Ftj | 1 < j < 2 d ~ e } individually. The vertices of a top FFT graph in Fig. 10.8 
are highlighted. To pebble its inputs, which are output vertices of FFT graphs below the 
split, it suffices to pebble the subtrees rooted at these vertices. (They are also highlighted.) 
Such subtrees are completely balanced binary trees with 2 inputs. Thus, d— e+ 1 pebbles 
and 2 e+1 — 1 pebble placements suffice to place a pebble on the root of one such subtree. 
If these subtrees are pebbled in sequence, pebbles can be left on the inputs to a 2 e -point FFT 
graph F^ above the split using at most 2 e + d — e pebbles and 2 e (2 d ~ e+1 — 1) pebble 
placements. Since 2 e + 1 pebbles and e2 e pebble placements suffice to pebble F^ e > level by 
level without repebbling vertices, it follows that all instances of F^ e ' above the split can be 
pebbled using a total of T = 2 d (2 d ~ e+1 + e — 1) pebble placements and S = 2 e + d — e 
pebbles. 

We now derive an upper bound on T by deriving upper and lower bounds on the value 
of e satisfying S = 2 e + d — e. Because S > 2 e , we have e < log 2 S. Let eo be the smallest 
integer such that 2 e ° +1 + d > S. Then, 2 e ° + d — eo < S and e > eo- Consequently, 
2 e > (S — d)/2, from which we have 

2 2d 

T = 2 d (2 d ~ e + l + e - 1) < 4 — + 2 d log 2 S 

{S -d) 

Finally, log 2 S < 2 d /{S - d) < 22 d /S when 2d < S < (2 d /d) + d, from which the 
desired conclusion follows. ■ 



10.5.6 Merging Netwo rks 



In this section we consider networks of comparators to merge two sorted lists. Such networks 
were described in Section 6.8 and an example was given, Batcher's (m,p) bitonic merging 
network. 

A comparator element computes the function (g> : A 2 *—■ > A 2 that returns the maximum 
and minimum of its two arguments, that is, (8>(a, b) = (max(a, b), min(a, b)). 

LEMMA I 0.5.5 Consider a comparator-based merging network that merges two sorted lists ofn 
distinct elements x = (x u x 2 , ■ ■ ■ ,x n ) (x t < x i+ i) and y = (y u y 2 , ■ ■ ■ , y n ) (j/, < yi+i) 
to produce the sorted list z = [z,\,Zi,. . . , z-m) of2n outputs (zi < Zi + \). There must be r 
vertex-disjoint paths from any r inputs in x to the outputs in z to which they are mapped by the 
network. 

Proof Working backwards from the r selected outputs, we see that each output exits from 
the comparator elements to which it is attached via a disjoint path, as suggested for three 
outputs in Fig. 10.9. Extending this argument to the remainder of the network establishes 
the result. ■ 

We next show that inputs can be given values to cause a merging network to shift its values 
in a fashion that permits the derivation of a space-time lower bound. 

THEOREM I 0.5.6 Any straight-line comparator-based program that merges two sorted lists ofn 
elements requires space S and time T satisfying 

ST = n{n 2 ) 
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Figure 10.9 Movement of an ordered subset of the items through Batcher's bitonic merge 
algorithm. 



This lower bound can be achieved to within a constant multiplicative factor when 2 log 2 n < S 
< (n/ log 2 n) + log 2 n. 

Proof Let n be divisible by 2. Any consecutive n/2 inputs in x can be shifted to the middle 
n/2 positions in z through a judicious choice of values for y. To see this, observe that the 
first k = n — n/4 — I components of y, I < n/2, can be chosen to be less than the first I 
components of x with the remaining n — k components of y chosen to be larger than the 
first l + n/2 components of £C. This will cause elements in positions 1 + 1,1 + 2, ... , l + n/2 
to shift into positions n — n/4 + 1, . . . , n + n/4. 

Since coalescing vertices in a graph reduces neither the time nor space needed to peb- 
ble it, coalesce input vertices assigned to x whose indices are equivalent modulo n/2. By 
Lemma 10.5.5, the new graph has n/2-vertex disjoint paths between the new inputs and the 
n/2 outputs in positions I + 1,1 + 2, ... ,1 + n/2 for each of the n/2 cyclic permutations. 
It follows that the argument applied to the cyclic shifting function (Lemma 10.5.2) applies 
to this function. Thus, the merging network computes a function containing a subfunction 
that is (2, n/2, n/2, n/4)-independent. The lower bound follows from Corollary 10.4.1. 

As shown in Section 6.8, the graph of Batcher's bitonic merging network is an FFT 
graph. Thus, the upper bounds given in Theorem 10.5.5 apply. ■ 



10.6 Worst-Case Tradeoffs for Pebble Games* 

In this section we show that degtee-d graphs on n vertices can be pebbled with 0(n/ log ri) 
pebbles (Theorem 10.7.1) and that some graphs require this many (Theorem 10.8.1). These 
results do not answer the question of how bad the space-time tradeoff can be for an arbitrary 
graph. To address this question we must make it precise. Lengauer and Tarjan [197] state it 
as follows: is there a value for the space S, say, Sj(n), such that for positive constants C\(d) 
and 02(d) if S < Ci(d)Sj(n), some graph on n vertices requires time superpolynomial in 
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n to pebble it, whereas for S > c 2 (rf)S'j(«.) all graphs on n vertices can be pebbled with a 
polynomial number of steps? They show that there is such a jump value for space and that 
Sj(n) = 0(n/loglogn). Since all graphs on n vertices can be pebbled with 0(nj log n) 
space, their result shows there exist graphs on n vertices that require time exponential in n 
when pebbled with this number of pebbles. 

10.7 Upper Bounds on Space* 

We establish upper bounds on space for the class G(n,d) of directed acyclic graphs on n 
vertices that have maximum in-degree d and out-degree 2. We limit the out-degree to 2 
because many straight-line programs with fan-out k > 2 (and their associated DAGs) can 
be reorganized so that each computation with fan-out k can be replaced by a binary tree of 
replicating subcomputations in which edges are directed from the root to the leaves. This at 
most doubles the number of vertices in the graph. (See Problem 10.12.) 

THEOREM 10.7.1 Let Q(n,d) be graphs with n vertices, in-degree d, and out-degree 2 for d 
fixed. Then S m i n (n,d), the minimum space needed to pebble any DAG in G(n,d), satisfies 
S min {n,d) = 0(n/logn). 

Proof Let E m i n (p, d) be the minimum number of edges in any graph in Q(n, d) that re- 
quires p pebbles in the pebble game. We show that E m i n (p,d) > cplog 2 P for some 
constant c > 0. From this it follows that 

P < 2{E nlin {p,d)/c)/log 2 {E min {p,d)/c) 

when p > 2 and E m - ln (p, d) > 2c. (See Problem 10.3.) 

Consider a graph G = (V, E) in Q(n, d) with \E\ edges. The number of edges incident 
on vertices is 2\E\. Since each vertex has at most d + 2 incident edges, 2\E\ < (d + 2)\V\ 
= (d + 2)n. The upper bound on the number of pebbles, p, follows from this fact and the 
previous discussion. 

Let G = (V, E) in Q(n, d) require p pebbles. An edge in E is a pair of vertices (u, v). 
Let V\ C V be vertices that can be pebbled with p/2 or fewer pebbles. Let Vj = V — V\. 
Thus, every vertex in V2 requires more than p/2 pebbles. Let E iy i = 1,2, be the set of 
edges both of whose endpoints are in Vi. Let Gi = (Vi, Ei). Let A = E — (E\ U E-i); that 
is, A is the set of edges joining vertices in V\ and V?- 

We now show that there exists a vertex in G2 that requires more than p/2 — d pebbles 
if the pebble game is played on G2 only. Suppose not. Then we show that every vertex in G 
can be pebbled with fewer than p pebbles. Certainly every vertex in V\ can be pebbled with 
fewer than p pebbles. Consider vertices in V 2 . We show they can be pebbled with fewer than 
p pebbles, thereby establishing a contradiction. 

Let v £ Vi be pebbled with p/2 — d or fewer pebbles when G2 alone is pebbled. In 
pebbling v as part of the complete graph G, we may need to pebble a vertex u> G V% some of 
whose immediate predecessors are in V\. As we encounter such vertices U), advance a pebble 
to each of lo's predecessors in V\ one at at time until all predecessors of uj are pebbled. After 
pebbling a predecessor in V\, remove pebbles in V\ not on such predecessors. When all 
of w's predecessors in V\ have been pebbled, pebble ui itself using one of the p/2 — d or 
fewer pebbles reserved for pebbling on V%. This strategy uses at most p/2 + d — 1 pebbles 
on vertices in V\ , at most d — 1 for all but the last predecessor in V\ and at most p/2 



484 Chapter 10 Space-Time Tradeoffs Models of Computation 

for the last such predecessor, and at most p/2 — d pebbles on vertices in V%, for a total of 
at most p — 1 . This is a contradiction. It follows that Gi requires at least p/2 — d + 1 
pebbles when pebbled alone and must have at least E m \ n {p/2 — d+ l,d) edges. Note that 
E min (p/2 -d+l,d)> E min (p/2 -d,d). 

There is also some vertex in G\ that requires at least p/2 — d vertices, as we show. By 
assumption every vertex in V\ must be pebbled. Suppose that each can be pebbled with 
p/2 — d — 1 pebbles. There must be a vertex r\ in V2 all of whose predecessors are in 
V\ . (If not, we can always move backward from a vertex in Vz to one of its immediate 
predecessors in V2, a process that must terminate since the finite acyclic graph does not have 
a cycle.) Thus, the vertex r\ can be pebbled with p/2 — 1 pebbles using the pebbling strategy 
described in the preceding paragraph for U), contradicting the definition of V2. It follows 
that G\ must have at least E m [ n (p/2 — d, d) edges. 

Consider now the set of edges A connecting vertices in V\ and Vj. If \A\ > p/A, 
E m i n (p, d) > 2E m - m (p/2 — d,d) + \A\ because both G\ and G2 have E mm (p/2 — d, d) 
edges. If |^4| < p/A, pebbles can be placed on the endpoints of edges of A in V\ using at 
most p/2 + p/A — 1 < 3p/4 pebbles, with the strategy for u! given above. If we leave at 
most p/A pebbles on these vertices, 3p/A pebbles are available to pebble the vertices in V 2 - 
If V2 does not require at least 3p/A pebbles, we have a contradiction to the assumption that 
p pebbles are needed. Thus, there must be an output vertex p that requires at least 3p/A 
pebbles, for if not, none of its predecessors can require more. 

We show that a graph requiring at least 3p/A pebbles has a subgraph with at least p/(Ad) 
fewer edges that requires at least p/2 pebbles. To see this, observe that some predecessor of 
the output vertex /i requires at least 3p/A — d pebbles. Delete /! and all its incoming edges 
to produce a subgraph with at least one fewer edge requiring at least 3p/A — d pebbles. 
Repeat this process p/(Ad) times to produce the desired result. It follows that G2 has at least 
E min (p/2,d)+p/(4d) edges. 

Thus, when either \A\ > p/A oi\A\ < p/A, at least 2E m [ n (p/2 — d, d) +p/(Ad) edges 
are required, and 

p 
E min {p, d) > 2E min (p/2 - d,d) + — 

The solution to this recurrence is -E m in(p> d) > cplogp for some constant c > l/8d and a 
sufficiently large value of p. ■ 



10.8 Lower Bound on Space for General Graphs* 

Now that we have established that every graph in Q(n, d) can be pebbled with 0(n/ log n) 
pebbles, we show that for all n there exists a graph G(n) in Q(n, d) whose minimum space 
requirement is at least c^n/ log n for some constant C5 > 0. 

The graph G(n) is obtained from a recursively constructed graph H (k) on 2 inputs and 
2 outputs, n/2 < 2 < n, by adding n — 2 vertices and no edges. The graph H (fc) is 
composed of two copies of H (k — 1) and two copies of an n-superconcentrator, which is 
defined below. 

DEFINITION I 0.8. 1 An n-superconcentrator is a directed acyclic graph G = (V,E) with n 
input vertices and n output vertices and the property that for any r inputs and any r outputs, 
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1 < r < n, there are r vertex-disjoint paths in G connecting these inputs and outputs. (Paths are 
vertex-disjoint if they have no vertices in common.) 

For n = 2 k Valiant [343] has shown the existence of n-superconcentrators SC(k) that 
have 2 inputs, 2 outputs, and cl edges. Since his graphs have in-degree greater than 2, 
replace vertices with in-degree d > 2 with binary trees of d leaves, thereby at most doubling 
the size of the graph. (See Problem 10.12.) This provides the following result. 

LEMMA I 0.8. 1 For some constant c > and each integer k and n = 2 there exists an n- 
superconcentrator SC{k) with c2 vertices. 

We let H(8) = SC(8). For k > 8 we construct H{k + 1) recursively from two copies 
of H{k), two copies of SC(k), and extra edges, as suggested in Fig. 10.10. Here edges are 
directed from left to right. The 2 output vertices of the first (leftmost) copy of SC(k) (called 
SCi(k)) are identified with the 2 k input vertices of the first copy of H(k) (called H\(k)), 
the 2 k output vertices of H\(k) are identified with the 2 k input vertices of the second copy 
of H(k) (called H 2 (k)), and the 2 k output vertices of H 2 (k) are identified with the 2 k input 
vertices of the second copy of SC(k) (called iSC^fc)). In addition, we introduce 2 +1 new 
input vertices and 2 +1 new output vertices. The first (topmost) half of the new inputs (called 
It) are connected via individual edges to the inputs of SC\ (k). The second (bottommost) half 
of the new inputs (called /&) are also connected via individual edges to the inputs of SCi(k). 
The new inputs are connected individually to the new outputs. Finally, each output of SCtik) 
is connected via individual edges to two new output vertices, one each in the top (called Ot) 
and bottom half (called Ob) of the new outputs. 



Inputs 



Outputs 




Figure 10.10 A graph H (k + 1) requiring large minimum space. 
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The graph H(k) has n(k) = |iJ(fc)| vertices, where n(k) satisfies the following: 

ra(8) = c2 8 
n(k + 1) = 2n(k) + (2c + 4)2 fe 

The solution to the recurrence is n(k) = (k — 7)c2 + (k — 8)2 +1 , as can be shown directly. 
The graph H(k) is in G(n(k), 2). 

Important subgraphs of H (k + 1) have the superconcentrator property, as we now show. 
This result is applied in the subsequent lemma to derive bounds on the amount of space used 
to pebble outputs of H(k + 1). 

LEMMA 10.8.2 The subgraphs of H (k + 1) on 1 inputs and 2 outputs defined by vertices and 
edges on paths from either inputs in It or inputs in lb to the outputs of SC\ and H\ (k) have the 
2 '-superconcentrator property. 

Proof The superconcentrator property applies to the outputs of SC\ (fc) by definition. Note 
that the jxh input o£H\(k) is connected to its j th output by an individual edge for 1 < j < 
2 . Thus, any r outputs of H\{k) have vertex-disjoint paths to the corresponding inputs of 
Hi(k). By the superconcentrator property of SC\{k), there are vertex-disjoint paths from 
these outputs of SC\(k) to any r of its inputs. These statements obviously apply to inputs 
in I t and lb- ■ 

Our goal is to show that pebbling the graph H(k) requires a number of pebbles propor- 
tional to n(k) / log n(k) . To do this we establish the following stronger condition, which 
implies the desired result. 

LEMMA 10.8.3 Letc x = 14/256, c 2 = 3/256, c 3 = 34/256, andc A = 1/256. To pebble at 
least c{2 outputs of H{k) in any order from an initial placement of at most c 2 2 pebbles requires 
there be a time interval \t\, tz] during which at least c$2 inputs are pebbled and at least c^2 
pebbles remain on the graph. 

Proof The proof is by induction on k with k = 8 as the base case. For the base case, 
consider pebbling Ci2 = 14 outputs during a time interval [0, t] from an initial placement 
of no more than c 2 2 = 3 pebbles. 

By Problem 10.27 any four outputs of SC(8) are connected via pebble-free paths to 
256 — 3 = 253 inputs. At least one of these four outputs, say v, has pebble-free paths to 64 
= [253/4] inputs. Let t\ — 1 be the last time at which all 64 of these inputs have pebble-free 
paths to v. Let ti be the last time at which a pebble is placed on these 64 inputs. During the 
time interval [t\, i 2 ] at least 64 > c$2 " inputs are pebbled and at least one pebble remains 
on the graph; that is, at least c^2 pebbles remain. This establishes the base case. 

Now assume the conditions of the lemma (our inductive hypothesis) hold for k. We 
show they hold for k + 1. Assume that at least C\l +1 outputs of H (k + 1) are pebbled in 
any order from an initial placement of at most c 2 2 +1 pebbles during a time interval [t a , tb\. 

We consider four cases including the following two cases. There is an interval [t\, i 2 ] C 
[t a , t},} during which at least c 2 2 pebbles are always on the graph and at least c 3 2 outputs 
of either (1) SCi{k), or (2) H x (k) are pebbled. By Lemma 10.8.2 the subgraph of H(k+ 1) 
consisting of paths from I t (and lb) to the outputs of each of these graphs constitutes a 2 - 
superconcentrator. This is the only fact about these two cases that we use. Without loss of 
generality, we show the hypothesis holds for the first of them. 
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The graph consisting of paths from inputs in I t to the outputs of SC\(k) constitutes a 
2 -superconcentrator. Prior to time t a there are at most C2I +l pebbles on the graph and 
during the interval [t\, £2] there are at least C22 (but at most C2I ) pebbles on the graph. 
Thus, there is a latest time to before t\ when there are at most c{l +1 pebbles on the graph. 
Since c^2 k > C22 k+l + 1 outputs of SCi(k) are pebbled in the interval [ti, £2] (and in 
the interval [£o,£2])> by Problem 10.27 at time £0 there are at least 2 k — C22 k+l > c$2 k 
inputs in I t (and in lb) that are connected by pebble-free paths to the pebbled outputs of 
SCi (k) . Thus, at least c^2 +1 inputs in I t and lb are connected via pebble-free paths to the 
pebbled outputs of SCi(k). In [£o,£i — 1] there are at least Cz2 +1 pebbles continuously 
on the graph, whereas there are at least C22 pebbles during [ii,^]- Since C22 > c^2 +1 , 
the number continuously on the graph in [£1, £2] is at least 0^2 +1 and we have the desired 
conclusion for H(k +1). 

In the third case, there is an interval [£ 1 , £2] Q [t a >tb] during which at least C\2 outputs 
of the full graph H(k+ 1 ) are pebbled and at least C22 pebbles are always on the graph. This 
implies that during [t\, £2] either c{2 J2 outputs in Ot or in Ob are pebbled, which in turn 
implies that at least C\2 k /2 outputs of SC2 (k) are pebbled. Since C\2 k /2 > C22 k+1 + 1 
(at most C22 fc+1 pebbles are on H(k + 1)), it follows from Problem 10.27 that at least 
2 — C22 " +1 > C32 inputs in I t (or lb) are connected via pebble-free paths to the pebbled 
outputs of SC2{k). The total number of such inputs is c$2 +1 . Since C22 > C4I + , there 
are at least C42 +1 pebbles on the graph continuously during \t\, £2] and we have the desired 
conclusion. 

In the fourth case none of the previous cases hold. Since c{2 +1 outputs of H(k + 1) 
are pebbled during [t a , tb], there is an earliest time £1 G [t a , tb] such that Ci2 outputs of 
H(k + 1) are pebbled in the interval [t a , t\ — 1]. Since the third case does not hold, there 
is a time t% < t\ such that fewer than C22 pebbles are on the graph at £2 — 1 and at least 
C\2 k outputs of H (k + 1) are pebbled in the interval [£2, fa]- It follows that at least c{2 k j2 
outputs of SC2{k) are pebbled during this interval. Since C\2 /2 > C22 + 1, it follows 
from Problem 10.27 that at least 2 fc — c 2 2 fc > c 3 2 fc inputs to SC 2 (k) (which are outputs to 
H2{k)) are connected via pebble-free paths to the pebbled outputs of SC2(k) and must be 
pebbled during [£2>ifc]- Since C}2 > C\2 , by the inductive hypothesis there is an interval 
[td'^e] Q [^2>*b] during which at least C}2 inputs of H 2 (k) (which are outputs of H\(k)) 
are pebbled and c^2 pebbles reside continuously on H 2 (k). 

Since the second case does not hold, by an argument paralleling that given in the pre- 
ceding paragraph there must be a time £3 G [£d>*e] such that at most C)2 k /2 outputs of 
H\{k) are pebbled during [td, £3 — 1] and fewer than C22 pebbles reside on H(k + 1) at 
£ c — 1. Thus, during [£3,£ e ] at least c 3 2 /2 > c{2 outputs of H\(k) are pebbled from 
an initial configuration of fewer than C22 pebbles. By the inductive hypothesis there is an 
interval [£/, t g ] C [£ 3 , £ e ] during which at least C32 inputs of H\ (fc) (which are outputs of 
SC\{k)) are pebbled and c^2 k pebbles reside on H\(k) continuously. 

Since the first case does not hold, again paralleling an earlier argument there must be a 
time £4 G [£/, t g ] such that at most C32 /2 outputs of SC\ (k) are pebbled during [tf, £4— 1] 
and fewer than C22 pebbles reside on H(k + 1) at £4 — 1. Thus, during [£4,£ g ] at least 
C}2 /2 > C22 + 1 outputs of SCi(k) are pebbled from an initial configuration of fewer 
than 022* pebbles. By Problem 10.27 at least 2 k - c 2 2 k > c 3 2 fe inputs of Sd(k) are 
connected via pebble-free paths to the pebbled outputs. Thus at least C32 corresponding 
inputs in both I t and lb must be pebbled for a total of at least C32 +1 inputs. 
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Since at least C42 pebbles reside continuously on both H\(k) during [id.t e ] and on 
H 2 (k) during \tf,t g ) and [tf,t g ] C [td, t e ], it follows that C42 k + c/i2 k = c^2 k+l reside 
continuously on H(k + 1) during [tf, t g ]. ■ 

We are now ready to show the existence of a graph on n vertices that requires to(n/ log n) 
minimal space. 

THEOREM 1 0.8. 1 For integers n > 1 there exists a graph G(n) in Q(n, d) that requires mini- 
mum space S m i n (G(n)) > c^n/ log n for some constant C5 > 0. 

Proof For n > 2 8 , let k be the largest integer such that n{k) < n; that is, n(k) < n < 
n(k+ 1). Construct the graph G(n) by adding n — n(k) vertices and no edges to the graph 
H(k). An optimal pebbling strategy for G(n) pebbles the added vertices one at a time using 
one pebble, after which H(k) is pebbled. From Lemma 10.8.3 it follows that pebbling 
H(k) requires at least C42 pebbles, since at least this many must reside on the graph at one 
time. Since n(k + 1) < 4n(k) for k > 8 and c > 2, it follows that n/4 < n(k) < n. This 
implies that 2 k < n and k < log 2 n and that n/4 < k(c + 2)2 fc < (log 2 n)(c + 2)2 fc . 
From this we have 2 > c^nj log 2 n, where C5 = l/(4c + 8). The conclusion follows by 
observing that at least (c4C^)n/ log 2 n pebbles are needed to pebble G(n). ■ 



10.9 Branching Programs 



The general branching program is a serial computational model that permits data-dependent 
computation, unlike the pebble game. A branching program is a directed graph consisting of 
a single starting vertex and in which vertices are labeled with predicates. Each vertex has one 
outgoing edge for each value of its predicate. (See, for example, Figs. 10.11 and 10.12.) Time 
in this model is the number of queries performed, and computations other than queries are 
not counted. The space used by a branching program is the base-2 logarithm of the number 
of vertices in its graph. Lower bounds on space and input time obtained with the branching 
program apply to within constant multiplicative factors to the pebble game and the RAM 
model. (See Section 10.9.1.) 

As noted in Section 10.1.1, since the branching program reads inputs in a less constrained 
manner than the straight-line program, it may be possible to solve some problems with branch- 
ing programs using less space or time than in the pebble game. As a consequence, space-time 
lower bounds for branching programs may be smaller than for the pebble game. Thus, if a 
problem is going to be solved with straight-line programs, such as an algebraic circuit, it is bet- 
ter to use lower bounds derived with the pebble game unless the branching program gives the 
same lower bounds. In particular, branching programs give smaller space-time lower bounds 
for integer multiplication and shifting (see Section 10.13.2) than does the pebble game. 

We examine two kinds of branching programs in this section, general branching programs 
and decision branching programs. 

DEFINITION I 0.9. 1 A multigraph is a graph that may have more than one edge between two 
vertices. A directed multigraph is a multigraph in which each edge has a direction. A directed 
acyclic multigraph (DAM) is a multigraph with no directed cycles. A rooted directed acyclic 

multigraph is a multigraph with a root vertex, a vertex with no edges directed into it, and is such 
that every vertex can be reached via some path from the root. A sink vertex has no edges directed 
away from it. 
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A branching program V with input variables x over the set A and output variables y over 
the set J- is a rooted directed acyclic multigraph that has a query q(x) associated with each vertex 
except for sink vertices and has a query outcome associated with every edge directed away from a 
vertex. Each edge may also carry as a label the values of some output variables, with the proviso that 
each output variable is assigned exactly one value along any one path from the root to a sink vertex. 

The decision branching program is a special kind of branching program in which the 
queries q(x) compare two variables and produce either the two outcomes {<, >} or the three 
outcomes {<,=, >}. Figure 10.11 shows an example of a decision branching program that 
merges two 2-element sorted lists (ui,Va) and (v\,V2) (u\ < u 2 and V\ < v 2 ) by using 
queries that compare the values of two input variables. Each vertex in the example has two 
out-directed edges corresponding to the results of the query. The outputs appear in sorted 
order along a path from the root to a leaf. 

A decision tree is a decision branching program whose DAM (directed acyclic multigraph) 
is a tree. A decision tree may be constructed for a sequential comparison-based sorting algo- 
rithm, such as Batcher's odd-even merging algorithm of Section 6.8, by associating the first 
comparison with the root, the second comparisons with the roots of the left and right subtrees, 
etc. 

DEFINITION I 0.9.2 A computation on a branching program V is a traversal of the unique 
path in the DAM from the root to a leaf determined by the values of the input variables in x = 
(x\ , x 2 , ■ ■ ■ , x n ) over the set A. The output of the computation is the sequence of output values 
in y = (2/1, 2/2> • • • > Vm) over the set T encountered on the edges of the path traversed. 

A function f^ n ' : A n <— > T m with input variables in x and output variables in y, namely 



f in) (x l ,x 2 ,...,x n ) = (2/1,2/2, 



• y m ) 



U 2 V\V2 




V 2 UiU 2 



Figure 10. 1 I A decision branching program that merges the lists (u\,v,2) and (vi,v 2 ) when 
U\ < U2 and Vi < Vi. 
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is computed by V if for each value of x the correct value of each output variable appears exactly 
once on each path from the root to a leaf. 

The time associated with a computation is the length of the path traversed by the computa- 
tion. The computation time T of a branching program is the length of its longest path. 

In Fig. 10.1 1 the computation associated with the input values (tii, Uz, t>i, Vz) = (2, 4, 1, 
3) takes the right branch out of the root and produces the output value V\ = 1, takes the left 
branch at the next vertex and produces U\ = 2, and takes the right branch at the last vertex 
and produces «2 = 3 and 1*2 = 4. The output of this computation is the sorted sequence 
1,2, 3, 4, as expected. This branching program merges the two sorted lists. Each sink vertex 
corresponds to one of the four ways of merging the two lists. The computation time of this 
branching program is 3. 

Branching programs that compare elements at vertices are well suited to merging and sort- 
ing but are not of the most general type. 

DEFINITION I 0.9.3 A general branching program V with input variables x over a finite set 
A has a query of the form Xi = ? associated with a variable Xi at each vertex. It also has one edge 
directed away from the vertex for each value ofxi. A general branching program is non-redundant 



each path from the root to a leaf a query : 



appears at most once. 



The general branching program is also known as a binary decision diagram (BDD). BDD's 
are widely used in the computer-aided design (CAD) of circuits for Boolean functions. 

A general branching program that convolves two short binary sequences over the integers 
is shown in Fig. 10.12. (Convolution is defined in Section 6.7.4.) A computation leaves the 
left branch of a vertex when the associated variable has value and the right branch when it 



<"2 



ci =0 



c 2 = 




Figure 10.12 A general branching program to compute the convolution of two sequences 
(ao, «i) and (b B ,bi). 



©John E Savage 10.9 Branching Programs 491 

has value 1 . This branching program computes the convolution c = a ® b of the sequences 
a = (ao, <Zi) and b = (bo, bi); that is, 

Co = a b , ci = a bi + aib , c 2 = a,\b\ 

The performance of a branching program is also measured by its space complexity. 

DEFINITION I 0.9.4 The space used by branching program V is the base-2 logarithm of the num- 
ber of vertices in its directed acyclic multigraph. 

As shown in the next section, this definition permits a lower bound on the space complexity 
used by any reasonable general-purpose computer model equipped with a random-access read- 
only memory for its input data. 

The following lemma demonstrates that every decision branching program can be simu- 
lated by a general branching program, thereby showing the latter to be more general than the 
former. (See Problem 10.35.) 

LEMMA I 0.9. 1 Every decision branching program with variables over a finite set A. with com- 
putation time T and space S can be simulated by a general branching program with computation 
time XT and space S + log(\A\ + 1). 

This result is proved by constructing a general branching program to simulate a comparison 
operator and substituting it for the comparison operator in a decision branching program. (See 
Problem 10.35.) The graph that results from this construction is explicitly a multigraph. 

While Lemma 10.9.1 establishes that decision branching programs are no more powerful 
than general branching programs, this does not imply that general branching programs require 
less space. In fact, the space complexity of a given decision branching program is independent 
of the size of the set A over which the variables are defined; this is not true for general branching 
programs. 

If space complexity is not an issue, a tree program can be constructed. This is a branch- 
ing program whose DAM is a tree. The following recursive procedure converts a branching 
program to a tree program: a) If any immediate descendant of the root has more than one edge 
directed into it, make as many copies of the submultigraph rooted at that descendant as there 
are entering edges and direct exactly one edge into each, b) Apply this procedure recursively to 
each of the submultigraphs until leaf vertices are reached. This procedure does not change the 
length of any path in the original DAM or the computation time. 

The notions of space and time can be generalized to average time and space when a prob- 
ability distribution is defined on input values. (See Problem 10.37.) 

Below we present a key lemma used to derive lower bounds on the space-time product. 
This lemma is stated for normal-form branching programs, general branching programs 
whose DAMs are level multigraphs, that is, multigraphs in which each vertex has a level and 
adjacent vertices are in adjacent levels. An example of such a graph is shown in Fig. 10.13. 

LEMMA I 0.9.2 If there is a general branching program of space S and computation time T for a 
function f, then there is a normal-form branching program for f that has space 2S and computation 
time T . 

Proof To convert a general branching program to a normal-form branching program, create 
T + 1 copies of the general branching program, one for each time step including the zeroth. 
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□ □□□□□□□ 

001 110 010 000 Oil 101 111 100 

Figure I 0. 1 3 A normal-form tree program for table lookup. It has one path for each value of 
the input. 



Delete the original edges and add an edge from vertex u in the ith copy to vertex v in the 
(i + l)st copy if there was an edge between u and v in the original graph. Now delete all 
edges and vertices that are not reached from the root of the zeroth branching program. (See 
Fig. 10.14.) 

This procedure increases the number of vertices by at most a factor of T, thereby in- 
creasing the space by adding at most log T. However, a branching program with space S 
has 2 vertices. Thus, the length of the longest path through the program T cannot exceed 
2 s , or S + log T < 2S.M 

Generally the space S used for a branching program computation will be large by com- 
parison with log T, in which case the space bounds for normal-form branching programs and 
general branching programs will differ by at most a constant factor. 

In the rest of this chapter when we speak of a branching program we mean a general 
branching program. 




Figure 1 0. 1 4 Construction of a normal-form general branching program as a level multigraph. 
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We close this section by describing a normal-form tree program for table lookup, an 
important programming tool that can be used to compute an arbitrary function f( n > : A n <— > 
A m on n variables whose value is an m-tuple. Each of the n variables is read and the value of 
the function is found in a table. This is simulated by a tree program with branching factor \A\ 
in which the variables are read in succession until they are all read, at which point the value of 
the function is provided. An example of such a tree program for a function /' ' : B i— > B 
is shown in Fig. 10.13. There is one path through the tree for each of the possible \A\ n 
assignments to the n inputs. The sink vertices are labeled by the appropriate m-tuple. Such 
table-lookup tree programs have computation time n and space proportional to n log |^4| since 
they have (|^4|™ + — 1)/(|„4| — 1) vertices with A edges per vertex except for those at the lowest 
level. 

10.9.1 Branching Programs and Other Models 

We begin this section with a comparison of branching programs and pebble games and con- 
clude with a brief comparison of branching programs and the RAM model of computation. 

The pebble model assumes that computation is serial and straight-line. If all algorithms 
used for a particular problem are of this type, the pebble game is the appropriate model, es- 
pecially if the lower bounds on space-time exchanges are larger than those provided by the 
branching program model. (All algorithms used today for integer multiplication are straight- 
line and the lower bounds on the space-time product for this problem are larger with the 
pebble game than with the branching program model.) If the two models give the same lower 
bounds, then we can invoke Lemma 10.9.3 to derive lower bounds on the space-time ex- 
changes for pebbling from those for branching programs when log 2 T-p is small by comparison 
with S-p, where T-p and Sp are the time and space used by the pebbling model. 

Data-dependent reading of inputs may allow the branching program to perform a com- 
putation more quickly than the pebbling model. For example, merging requires a space-time 
product that is quadratic in the length of the input strings with the pebble game but only 
linear in the branching program. (See Section 10.10.2.) This demonstrates that the branching 
program is a much more natural model for this problem. 

If the lower bounds derived with the branching program are comparable in strength to 
those offered by the pebbling model, as is true for most of the problems considered in this 
chapter, straight-line programs are the better model for these problems. But the extra flexibility 
offered by branching programs means that when their results are comparable to those provided 
by the pebble game, one must work harder to obtain them. (See Sections 10.1 1 and 10.12.) 

The branching program measures the time to read inputs but ignores the time for com- 
putations and the production of outputs. By contrast, the pebble game measures the time to 
read inputs, perform computations, and produce outputs. Although the time for computations 
generally cannot be ignored, the methods available today to derive lower bounds for both mod- 
els are based on the time spent reading inputs. But while for many problems the time to read 
inputs dominates computation time for many values of space, when space is large the pebbling 
model has the potential to give larger lower bounds than the branching program model. For 
example, no way is known to compute the n-point DFT with fewer than Q(nlogn) steps, 
the number used by the FFT algorithm, although in the limit of large space the branching 
program gives a lower bound on space proportional to n. 

To simulate the pebbling of a DAG by a branching program we must give an interpreta- 
tion to each vertex of the DAG: assign an operation to each non-input vertex and a variable as 
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well as values to each input vertex. Two different interpretations of a DAG may yield different 
branching programs. Of course, a DAG is pebbled without regard to the interpretation of ver- 
tices: the pebble-game lower bounds use only the fact that vertices can hold one of \A\ values 
and do not depend explicitly on the interpretation given to their operator. 

LEMMA I 0.9.3 Given a pebbling V of an interpreted directed acyclic graph G that uses S-p 
pebbles and Tp input steps to compute a function with operations over a finite set A, there is a 
branching program with space S-p log \A\ + log (2Tp) and time Tp that computes the function 
computed by G. Thus, if ITp < |„4| v , simultaneous lower bounds on the space and time for 
a branching program for the function imply simultaneous lower bounds on space and time in the 
pebble game that differ by at most constant multiplicative factors. 



Proof We construct a branching program Q to simulate the pebbling V of a directed acyclic 
graph that uses S-p pebbles and Tp steps. (Figure 10.15 illustrates the construction of such 
a branching program.) Initially the branching program has a single vertex, the root, which 
is labeled with the first variable to be pebbled according to V . Advance the first pebble as 
far as possible. Create a vertex in the branching program for each value of the operation 
or input covered by the first pebble. Label these new vertices with the name of the second 
input to be pebbled and attach an edge from the root vertex to these new vertices labeled 
with the corresponding value for the first input. Advance pebbles as far as possible according 
to V and create one new vertex in the branching program for each different tuple of values 





(a) 



(b) 



Figure 1 0. 1 5 A general branching program (b) that simulates the pebbling of a DAG (a) in the 

vertex order 1, 2, 4, 3, 5, 6, 7. The DAG input variables are denoted u, v, w, and x and assume 
values in {0, l}. + denotes OR and * denotes AND. 
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residing under the pebble(s) currently on the DAG. (In the example of Fig. 10.15, after 
placing a pebble on the second vertex we advance a pebble to the third vertex and remove 
all other pebbles. Thus, only two vertices are added to the branching program at this step.) 
Label the new vertices with the third input to be pebbled. Now repeat the above process 
by advancing pebbles as far as possible (in the example, pebbles now reside on the third and 
fourth vertices), add one new vertex for each tuple of pebbles on the DAG (four vertices are 
added), and connect edges from the previous to the current set of new vertices that conform 
to the values assumed at the vertices of the DAG. This process is repeated until all inputs 
have been pebbled. 

Since the values of operations are always determined by the values under at most S-p 
pebbles, the number of new vertices added in Q with the pebbling of each new input vertex 
in G is most \A\ v . Since T-p input vertices of G are pebbled, it follows that Q has at most 
T-p|.4| v + 1 < 2T-p|„4| v vertices, from which the conclusion follows. ■ 

A branching program can also simulate a computation by a general model of computation, 
such as the RAM discussed in Section 3.4, as we now show. Let the RAM have AI 6-bit words 
of memory and a finite number of 6-bit words in its CPU. Consider any program for such a 
machine. Its state is determined by the values in its registers and memory locations. Thus the 
RAM has at most 0(2 Mb ) states. Let the space used by a RAM be the base-2 logarithm of 
the number of its states. Let the RAM execute Tram steps to read its inputs. We simulate 
this computation in the same fashion as with the pebble game. After reading an input variable, 
the branching program enters one of at most 0(2 ) vertices corresponding to states of the 
RAM. Since the RAM reads inputs on Tram steps, the branching program also takes Tram 
steps and has at most 0(Tram2 ) vertices or uses space of at most 0(Mb + log Tram)- 
As long as Mb is larger than some multiple of log Tram, simultaneous lower bounds on the 
time to read inputs and space of a branching program for a function computed by the RAM 
serve as lower bounds on the same quantities on the RAM. The following lemma summarizes 
this discussion. 

LEMMA I 0.9.4 Given a RAM program that uses space 5ram ^»^Tram input steps to compute 
f : A n i— > A m there is a branching program with space 0(Sram + log (2Tram)) and time 
Tram that computes f. Thus, if 2Tram < 2 MU , simultaneous lower bounds on the space and 
time for a branching program for the function imply simultaneous lower bounds on the space and 
time on the RAM that differ by at most constant multiplicative factors. 



10.10 Straight-Line Versus Branching Programs 

In this section we show that some problems can use space and time more efficiently with 
branching programs than they can with the pebble game. We demonstrate this for the cyclic 
shifting function / c ™i ic : i3"+l lo S"l i-^ g" introduced in Section 2.5.2 and the merging 
problem introduced in Section 6.8. However, for all of the other problems studied in this 
chapter the lower bounds obtained with these two models are the same up to constant mul- 
tiplicative factors, except for integer multiplication, where the branching program bound is 
smaller by a factor of log n. 

It is important to note, however, that the superiority of branching programs arises from 
the assumption that inputs can be read in a data-dependent fashion, an assumption that is 
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not available to straight-line programs. As we know from Problem 10.20, if branching is 
allowed but inputs must be read in a data-independent fashion by an input-output-oblivious 
finite-state machine, Theorem 10.4.1 applies. Thus, branching programs that read inputs in 
a data-independent fashion have no advantage over straight-line programs, at least in terms of 
lower bounds on space-time exchanges. 



10.10.1 Efficient Branching Programs for Cyclic Shift 

An) 
'cyclic 



We present a branching program for / c "j ic that uses space S = 0{logn) and time T 



n + [logn]; that is, ST = 0(n log n), a product that is much less than the Q(n 2 ) product 
required in the pebble game. (See Section 10.5.2.) 

The function / cvc i ic has n + [log n\ Boolean variables, [log n\ control inputs, and n 
"value" inputs whose values are shifted by the amount specified by the control inputs. Our 
efficient branching program is a tree program (see Fig. 10.13) that reads the control inputs 
and selects one of n paths through the tree. (Note that n < 2" ogin ' < 2n.) Each path 
corresponds to one of the n possible cyclic shifts of the n value inputs. Attached to a leaf of 
this tree is a chain of vertices, one per value input. These inputs appear in the order specified 
by the cyclic shift associated with the path. An input value is read and then produced as output 
at each of these n vertices. Since this branching program has at most In + 2n vertices, it has 
space 0(log n). It uses time n + [log n] . 

If cyclic shifting is to be done by a straight-line program, say in hardware, then it is better to 
use the pebble game for lower bounds since this model applies to logic circuits and the results 
it provides are stronger. However, if the problem is to be executed in software, the branching 
program should be used unless the program is straight-line. 

10.10.2 Efficient Branching Programs for Merging 

Consider now the merging problem. In Section 10.5.6 we show that it requires an Q(n ) 
space-time product where n is the size of the input. However, when executed by a branching 
program it uses space 0(log n) and time O(n), as we show. 

Figure 10.11 shows a "pyramid" decision branching program to merge two sequences of 
length two. It is straightforward to extend this decision branching program to sequences of 
length n, as suggested in Fig. 10.16. In this figure vertices are labeled by the number of 
elements that are removed from the two lists being merged before arriving at the vertex carrying 
the label. For example, prior to arriving at the vertex labeled (2, 1), two elements have been 
removed from the left list and one from the right list. We assume that the lists to be merged 
each contain n elements. Thus, all the pyramid vertices below a vertex labeled with (n, k) or 
(k,n), 1 < k < n — 1, are deleted because below such vertices no further comparisons are 
needed; the outputs produced are those on the list from which k values have been removed. 
Thus, we attach a chain of n — k vertices, one for each of the input values at the end of the 
smaller list. If the root is at level 1, vertices labeled (n, k) and (k, n) are at level n + k + 1 < 
2n+ 1. 

The number of vertices on level I of this decision branching program is at most I. Since 
1 < I < 2n, it has at most 52; =1 I = (n + l)(2n + 1) vertices. The space associated with 
this program is 0(log(n + l)(2n + 1)). Since the length of the longest path in this program 
is 2n, it has time 2n associated with it. From Lemma 10.9.2 it follows that merging can be 
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(0,0) 




(3.0W V(2,1)Y((1,2)V(0,3) 



Figure I 0. 1 6 The top portion of a decision branching program to merge two sorted lists. The 
pair of integers at a vertex denotes the number of elements removed from the left and right lists 
by the program before arriving at the vertex carrying the pair. 



realized by a general branching program with space O(logn) + log \A\ and time O(n) or a 
space-time product that is 0(n log n), much smaller than the 0(n ) space-time product that 
applies to the pebble game. 

10.11 The Borodin-Cook Lower-Bound Method 

In this section we generalize the method of Borodin and Cook [53] for deriving space-time 
lower bounds for branching programs. The conditions under which lower bounds can be 
derived are captured by a property of functions called ((f>, A, p, v, T)-distinguishability, which 
is stronger than the flow property used to derive lower bounds on space-time tradeoffs for 
the pebble game. In fact, we show that a function that is (1, A, fX,v,r) -distinguishable is 
(a, n, m, p) -independent for the appropriate values of a, n, m, and p. 

DEFINITION I 0. 1 I.I LetT :NhN be a nondecreasing function. A function f : A n i— > T m 
is (</>, A, [i, v, t ) -distinguishable for < (/>, A, fj,, v < 1 if there is a setV C A n satisfying 
\V\ > </>|^4|" such that for each assignment to a selection of a < Xn input variables and each 
assignment to a selection of b < p,m output variables of f, a < r(b), the number of input 
n-tuples consistent with the values of the a input variables that cause f to assume the given values 
for the b output variables is at most \A\ n ~ a ~ uh ■ 

The meaning of this property for the function / is suggested by Fig. 10.17. For a fraction 
of <j> of the input tuples (4> = 1 is the normal case), when any a input variables and any b 
output variables of / are assigned values, the maximum number of input n-tuples that cause 
/ to produce these output values is no more than |.4|™ _a_ . This property is used below to 
derive a lower bound on the space-time product for branching programs. We use <p = 1 for all 
problems considered below except for the unique elements problem. 

This theorem also uses a version of the pigeonhole principle. Time is subdivided into 
intervals containing equal numbers of input queries. This has the effect of chopping the 
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Input tuples consistent with a fixed inputs and b fixed outputs 




Output tuples containing b fixed outputs 



Figure 1 0. 1 7 For a fraction of at least cj> of the input n-tuples, an ((f), A, fj,, v, r)-distinguishable 
function / has an upper limit of |.A| n-a— " on the number of input n-tuples consistent with 
an assignment of values to any a inputs and any b outputs of / when a < An, b < fim and 
a< r(b). 



branching program up into layers (called stages in the proof). We reason that each input n- 
tuple follows a rich path through a layer that contains a large number of outputs. Because of 
the distinguishability property, an upper limit on the number of inputs can be associated with 
each rich path. It follows that there must be many rich paths or that the branching program 
must have a large number of vertices (and space). 

THEOREM 10. 1 I.I Let f : A n i-> T m be ((f), A, \i,v,t) -distinguishable for A < fi. Then 
the space S and time T > n required by any general branching program V that computes f must 
satisfy 

mua , . . , 1 
S > -jp log 2 \A\ + - log 2 

where a < An is the largest integer satisfying a < T(ma/2T) and n > ([l/A] — 2)/(l — 
A([l/A] — 1)). (Note that log 2 is a negative constant.) 

Proof We show that S > mvajIT log 2 |«4.| + log 2 <fi for normal-form branching programs 
and then invoke Lemma 10.9.2 to apply it to a general branching program with space 25 
and time T. 

The approach is to breaks into a = |~(T+l)/(a+l)] disjoint stages starting with the 
root at the zeroth level, each stage of which contains a + 1 levels, a < An, except possibly 
for the last, which may have fewer levels, (a < IT j a since T > n > 1.) Each stage has 
depth a. Thus, the last row in one stage is the first row in the next stage. Each stage except 
for the first typically has multiple roots. (Figure 10.18(a) shows a branching program with 
T = 5 levels. Since a = 2, it is divided into a = \(T + I) /(a +1)] =2 layers by the 
horizontal line. Internal vertices belong to two layers.) 

Using a modified version of the technique described on page 49 1 to create a tree program 
from a branching program, replace the branching program in each stage by a set of tree 
programs of depth a, shown in Fig. 10.18(b). Eliminate redundant queries on each path in 
each tree. Also, pad paths that do not have a queries on them with superfluous but non- 
redundant queries so that each path through each tree has the same length. A superfluous 
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(a) 



(b) 



Figure 10.18 The transformation of a T-step branching program into a branching program 
with a = \(T + l)/(i +1)1 layers in which each layer consists of a forest of trees. 



query has all of its output edges directed to a single successor vertex. Also, move all tree 
outputs down to the leaves of these trees (which are also roots of trees in the next stage). Let 
T* be the new branching program. Since the roots of trees in each stage are vertices in the 
original branching program, there are no more than 2 trees. 

Let x be one of the input n-tuples among the fraction (f> for which ((f), A, /!, v, r)-dis- 
tinguishability is defined. The path through V* defined by x passes through a stages. 
Therefore, there must be at least one stage containing a tree path that produces at least 
b = \m/a~\ outputs (a rich path). (As shown in the last paragraph of this proof, b < |~um] 
when A < u for sufficiently large n.) Thus, x defines at least one rich path. Let a < t(6). 
Because the function / : A" i— > T m is ((f), A, /!, v, r)-distinguishable, each rich path can be 
associated with at most |„4|" _a_ inputs. (This number is smaller if more than b outputs 
are produced.) Since there are at most 2 trees and at most \A\ a paths through each tree, 
there are at most 2 |^4| a rich paths. Furthermore, two distinct rich paths (either the inputs 
queried or outputs produced are different) are associated with disjoint sets of input n-tuples. 
Thus, 2 |„4| a |„4| n_a_ cannot be less than the number of input n-tuples in question, 
from which the following inequality holds: 



^\A\ n <2 s \A\ a \A\ 



n—a—ub 



We conclude that 



S>vb\og 2 \A\ + -\og 2 (f> 

We replace b = \mj a\ by its lower bound ma/2T. Since r(b) is a nondecreasing function, 
the value of a satisfying a < r(b) is not increased by replacing b by ma/2T. Thus, S > 
v(ma/2T) log 2 |.A| + log 2 (f>, subject to a < T(ma/2T) and a < An. 

We show there exists an integer n Q such that for n > n a the condition b < \fJ.m~\ 
is met by the condition A < fi. Note that b = \m/a~\ is a nondecreasing function of 
a and a nonincreasing function of T since a = [(T+l)/(a+l)] is a nonincreasing 
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function of a and a nondecreasing function of T. Thus, b is largest when T = n and 
a = An. It follows that b is largest when a = \(n + l)/(An + 1)] < |T/A~|. If n > 
{\l/X\ - 2)/(l - A([l/A] - 1)), then (n + l)/(An+ 1) > [1/A] - 1, which implies that 
[(n+l)/(An+l)] = [1/A]. In other words, when n > ( [1/A] -2)/(l - A([l/A] - 1)), 
b assumes a value of at most [Vn/fl /A]] < [Am~|.l 

COROLLARY I 0. 1 I . I Let f : A n i-> T m be {4>,\,\i,v, ^-distinguishable for A < ^ «</ 

r (6) = n. Then the space S and time T required by any normal-form branching program V that 
computes f must satisfy 

„_ mnXu . . ., 

ST > -^— log 2 |^4| + log 2 

whenT> nandn> ([1/A] -2)/(l - A([l/A] - 1)). 

Proof The result follows from the observation that the maximum value of a in Theo- 
rem 10.11.1 is An. ■ 

The connection between (a, n, m,p) -independence and (1, A, /i, v, T)-distinguishability 
is given below. 

LEMMA 10. 1 I.I Iff : A" t-> T m is (1, A, ix, v,r) -distinguishable, it is (l/u,n,m,p)- 
independent for p = rnin(An, T(fim)) + /xm. 

Proof Consider sets of a input and b output variables to / such that a < r(b), a < An, and 
b < /xm, or equivalently a < r*, wherer* = min(An, r(/im)) since t(x) is nondecreasing 
in x. For any particular assignment to the a inputs, the input n-tuples that agree with this 
assignment but lead to different values for the b outputs must be disjoint, as suggested in 
Fig. 10.19. We show that for some assignment of values to the a inputs, the number of 
values assumed by the 6 outputs is more than |„4| ' Q_1 for a = \/v. Suppose not. Then 
there are at most |.4| n_a_ |yl| _1 input tuples for each assignment to the a inputs, or a 
total of at most |.4|™ -1 input tuples. Since / has \A\ n input tuples, we have a contradiction. 
Therefore, / is (l/f, n, m,p) -independent for p = t* + /xm. ■ 

The following lemma makes it easier to derive space-time lower bounds for branching 
programs. It uses the notions of subfunction (see Definition 2.4.2) and reduction (see Defini- 
tion 2.4.1). 




Figure 10.19 On the left are the points in the domain of/ that map to individual output 
6-tuples when the values of o input variables are fixed. 
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LEMMA I 0. 1 1 .2 Let g : A r <— > A s be a reduction off : A n <— > *4 m that is either a subfunction 
or a reduction obtained by restricting / to a subset of its domain. A lower bound to the space-time 
product ST on branching programs for g is also a lower bound for f. 

Proof Given any branching program for /, we can construct one for g that has no more 
vertices or longer paths as follows. If g is obtained by deleting outputs, delete these outputs 
from vertices in the branching program. This may allow the coalescing of vertices. If g is 
obtained by restricting the set of values that variables of/ can assume, this may make some 
paths and subgraphs inaccessible and therefore removable. If g is obtained by giving two 
variables of / the same identity, this constrains the branching program and again may make 
some subgraphs inaccessible. In all cases neither the number of vertices nor the length of 
any path to a sink vertex is increased by the reduction of/ to g. Thus, any lower bound to 
ST for g must be a lower bound for /. ■ 



10.12 Properties of "nice" and "ok" Matrices* 

In this section we develop properties of matrices that are 7-nice or 7-ok, concepts we now 
introduce. (A matrix that is 7-nice is also 7-ok.) These properties are used in Section 10.13 
to develop lower bounds on the exchange of space for time using the Borodin-Cook method. 
This section requires a knowledge of probability theory. 

DEFINITION I 0. 12. 1 An n x m matrix A, n < m, is 7-nice for < 7 < 1/2 if and only if 
for all p < [771] and q > n — pyjz] every p x q submatrix of A has rank p. Such a matrix is 
7-ok if all such p x q submatrices have rank at least jp. 

As shown below, most matrices are 7-nice, a fact that is used in several places. 

LEMMA 10. 12.1 At least a fraction (1 — |.A| _1 (2/3) 7n ) of the \A\" n x n matrices over a 
subset A of afield, \A\ > 2, are 7 -nice for some constant 7, < 7 < \, independent of ' n and A. 
This result also holds for n x n Toeplitz matrices, matrices [ty] with the property thattij = a^j; 
that is, all elements on each diagonal are the same. 

Proof Let r = \jn~\ and s = n — r. The proof is established by deriving upper bounds on 
the number N(r, s) of r x s matrices in an n x n matrix M and the probability q(r, s) that 
any particular r x s matrix fails to contain a non-singular r x r submatrix (it fails to have 
rank r) when each entry in M is equally likely to be an element of A. Since the probability 
of a union of events is at most the sum of the probabilities of the events, the probability that 
some r x s matrix fails to have rank r is at most q(r, s)N(r, s). 
It is straightforward to show that 

N{r,s)= ( U 

since an r x s submatrix of an n x n matrix is chosen by selecting a set of r rows and 
a set of s columns and each can be chosen in (™) ways. (Note that (") = (") .) We 
now show that the binomial coefficient (") is at most (n/r) r e r . We use the fact that 
n\/(n — r)\ = n(n — 1) • • • (n — r + 1) < n r and the observation that r r Jr\ is a term in 
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the Taylor-series expansion of e r , as stated below: 



n\ n r /n\ r r r f n \ r r 

r\(n — r)\ ~ r! \r/ rl ~ \r/ 



Later we show that q(r,s) < p- s \A\ r ~ l , where p = |yt| 2 /(2|^| - 1) < 2\A\/i, from 
ifhich it follows that 



-) p- n P r \A\ r 



n (en\A\ 



< \Ar 



since s = n — r. Elementary calculus shows that (e|.4|/r) is an increasing function of 
r and that it has value 1 at r = 0. Since r = \jn\ and p > 4/3, it follows that the 
quantity in square brackets is less than 1 for some value of < 7 < 1/2, which is the 
desired conclusion. 

We now give a proof by induction that q(r, s) satisfies q(r, s) < p _s |„4| r_1 . Clearly 
?(!> 1) < l/l-4|> since at most one entry in A is zero. This satisfies the bound. We now 
assume the inductive hypothesis holds for q(r — 1, s — 1) and q(r, s — 1) and show that it 
holds for q(r,s). 

Consider anrxs matrix B. It has rank r if the submatrix consisting of the first S — 1 
columns has rank r. (This occurs with probability 1 — q(r, s — 1).) If this is not the case, 
there are many other ways in which it can have rank r. In particular, this is true if the 
submatrix C consisting of the last r — 1 rows and the first s — 1 columns of B has rank 
r — \ (with probability 1 — q(r — 1, s — 1)) and the element b\ yS has an appropriate value 
(with probability at least 1 — l/|„4|), as we now show. 

Consider a submatrix D consisting of some r — 1 linearly independent columns of C. 
Consider the r x r submatrix of B consisting of these same r — 1 columns and its last 
column. When the determinant of this matrix is expanded on the first row, the multiplier of 
&i ]S is ±1 times the determinant of D, which is non-zero. Thus, there is at most one value 
for b\ <s that causes the determinant to be zero (the field element causing it to be zero may 
not be in the set A) or at least |„4| — 1 values that cause it to be non-zero. Summarizing this 
result, we have the following lower bound: 

1 



1 - q(r, s) > 1 - q(r, s - 1) + (1 - q(r - 1, s - 1)) 1 



1-41 



> (1 - q(r, s - 1)) JL + (1 _ g( r _ i, 8 _ l)) (l - |4 



This implies that 



q(r,s) < q(r,s- lW + q(r - 1, s - 1) f 1 - -^r 



1 



< p- s \A\ r ~ l 



1 
\A\ 
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which is the desired conclusion. 

The proof also holds for Toeplitz matrices (each element on a diagonal of the matrix 
is the same) because we reasoned only about the value of elements in the upper right-hand 
corner of submatrices that are on different diagonals. ■ 

The Kronecker product of matrices is used in Section 10.13.5 to derive a lower bound on 
the space-time product for matrix inversion. 

DEFINITION I 0. 1 2.2 The Kronecker product of two n x n matrices A and B is the n 2 x n 2 

matrix C, denoted C = A ® B, obtained by replacing the entry dij of A with the matrix QijB. 



A Kronecker product C = A < 



B of matrices A and B is shown below: 

6 



1 2 
3 4 



B 



C 



5 6 10 12 

7 8 14 16 

15 18 20 24 

21 24 28 32 



The following property of the Kronecker product of two 7-nice matrices is used to derive 
the space-time lower bounds stated in Theorem 10.13.5. 

LEMMA 10. 12.2 If A and B are both n x n ^-nice matrices for some < 7 < 1/2, then 
C = A® B is ann 2 x n 2 rf-ok matrix. 

Proof Number the rows and columns of A, B, and C consecutively from 0. For a matrix 
E, extend the notation ejj for the entry in the ith row and jth column of E to ejj, by 
which we denote the submatrix of E consisting of the intersection of the rows in the set I 
and columns in the set J. Thus, if/ = {i} and J = {j}, then eij = e^j. 

To show that C is 7 2 -ok, we must show that every p x q submatrix S of C satisfying 
P < \j 2 n 2 ~\ and q > n— \j 2 n 2 ~\ has rank at least j 2 p. Such a matrix S can be represented 
as S = cij for index sets / and J, where p = \I\ < \j 2 n 2 ~\ and q = \ J\ > n — \^ 2 n 2 ~\ . 
We assume that 771 > 1 , since otherwise the result holds trivially. 

The rth block row of C is the submatrix [a r fiB, a r \B, . . . , a rn _iB] containing rows 
numbered I r = {rn, rn + 1, . . . ,rn + n — 1} and all n 2 columns. 

Let A r = In {rn, rn + 1, . . . , rn + n — 1} be the indices of the rows of S that fall into 
the rth block row. Choose a set T C {0, 1, 2, . . . , n— 1} of size |T| = [7^] that maximizes 
the sum T = 2 r er l^rl- Then, T > ~/p because the lower bound is achieved if the rows 
of S are uniformly distributed over the rows of C and T is larger if they are not. 

Let A r = A r if |A r | < [771] and let A,, consist of the smallest [771] indices in A.,. 
otherwise. Clearly, |A r | > |A r |7 because A r is chosen from a set of size n. Call rows of C 
with indices in IJrer ^-r blue rows. There must be at least 7 p blue rows because, if not, 



J 2 p > Y, \ A A > J2 l A ^ = ^ T ^ ^ 2 P 

which is a contradiction. 

We now show that the blue rows of S are linearly independent. Suppose not. Then 
there exist constants {a r :S \ r € T, s € A,.} not all of which are zero such that the linear 
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combination of the blue rows of S is zero 

EE 

rGTsGA 



.■,,sCnr+s,J = (10.7) 



Here is a column vector of zeros, one per blue row. Again, J is the set of columns of C in 
the submatrix S. 

Column j of the n x n matrix B Is good if it is associated with at least (1 — 7)71 columns 
of S and is bad otherwise. Let G be the indices of the good columns in B and let g = \G\. 
Then there are g > (1 — 7)71 good columns and 6 < 777, bad columns in B (g + 6 = n) 
because, if not, <7 < (1 — 7)77— 1 and the number of columns altogether in S is at most 
gn + 6(1 — 7)77, which is an increasing function of g whose value is less than 77 — [7 n ] 
when g < (1 — 7)77 — 1, which is less than the number of columns of S. 

Since B has at least g = \G\ > (1 — 7)77 good columns and B is 7-nice, any set of up to 
[771] rows are linearly independent. In particular, the rows of B indexed by A,, are linearly 
independent. This implies that 

y^ a r , s b s ,G / 

sGA r 

where is a zero column with |A. r | rows. Thus, there must be a column index t 6 G such 
that 

Y, oc r ,sb s ,t ^ (10.8) 

sGA r 

Let K = {j I 77 j + t € J} be the columns of S corresponding to the good column of B 
with index t. It follows that \K\ > [(1 — 7)77]. 

Let Ui = Cij n K> the intersection of the 7th row of S with columns whose indices are in 
K. Similarly, let Vi be the intersection of the 7th row of A with columns in K. It follows 
from the definition of C that u n i+j = bj^Vi. From (10.7) we have that 

/ „ / _ a r:S c nr+s> j n K = 
i-er.seA r 




However, the rows |T| rows v r constitute a [777] x \K\ submatrix of the 7-nice matrix A 
where \K\ > [(1 — l) n \- Since its rows are linearly independent, each of the coefficients 
SsgA a r,sb s ,t must be zero, contradicting the statement of (10.8). It follows that C = 
A (8) B is 7 2 -ok. ■ 



10.13 Applications of the Borodin-Cook Method 

In this section we illustrate the Borodin-Cook method of Section 10.11 by applying it to a 
variety of representative problems. 
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10.13.1 Convolution 

The wrapped convolution function /trapped • ^ 2 " l— * ^" over tne rm 8 ^ ( see Problem 6.19) 
of two sequences u and v is described by the matrix-vector product Cv of a circulant matrix 
C in which Cjj = U(i-j) mod n> as shown in Section 10.5.1. 

LEMMA I 0. 1 3. 1 For n even, the wrapped convolution /^ripped : ^ 2 ™ l— * ^" <wer ^ ringlZ 
contains a subfunction g^ n > : lZ 2n i— ► 7^"' 2 ?/w£ « (1,7/2, 7/2, 1, 2n) -distinguishable for some 
< 7 < 1/2. 

Proof Writing C as a 2 x 2 matrix of n/2 x n/2 matrices, we find that its (1,1) entry is 
an unrestricted Toeplitz matrix T. That is, each diagonal can contain a different element. 
Consider the subfunction of /^"L, d defined by this submatrix. By Lemma 10.12.1, a 
fraction of at least 1 — (2/3) ( - 7 ' 2 - ) ™/|7?.| of such matrices are 7-nice. By Definition 10.12.1, 
this implies that [(7/2)71] output variables assume |7?]^ 7 ' '"' different values. If we fix 
the entries of T to be those of a 7-nice matrix, by Lemma 10.1 1.2 the lower bound on ST 
for matrix-vector multiplication with a Toeplitz matrix with n replaced by n/2 serves as a 
lower bound for the original problem. Since for large 77 most Toeplitz matrices are 7-nice, 
we have the desired conclusion. ■ 

Invoking Theorem 10.11.1, we have the space— time lower bound stated below. The up- 
per bound follows from the design of a branching program to implement the inner product 
operation, as suggested by Fig. 10.6. 



space S used by any general branching program for the wrapped convolution /trapped ' ^ 2 ™ 



THEOREM 1 0. 1 3. 1 There is an integer n > such that for n even andn > n , the time T and 
space S used by any general bran 
lZ n over the ring 1Z must satisfy 

ST=n(n 2 \og\K\) (10.9) 

Branching programs exist that achieve the following bound for\og\R\ < S < nlog|7£|: 

ST = <3(t7 2 log 71 log |ft|) 

Proof Since the wrapped convolution function depends on 2n variables, it can be computed 
via table lookup with space 0(t7 log \1Z\) and time O(n). 

At the limit of small space, namely for S = 0(log \1Z\), a branching program can 
be designed that computes the 77 inner products defined by the matrix-vector product of 
(10.1). An example of a branching program to compute the inner product of two 3-vectors 
is shown in Fig. 10.20. A branching program for the inner product of two 77-tuples can be 
constructed that has 0(n\lZ\ 2 ) vertices and depth 0(n). Hence, a branching program to 
multiply a general n x n matrix by a vector can be constructed that has time 0(n 2 ) and 
space 0(logn + log \TZ\). 

To fill in the range between these extremes, let k divide n and note that the product of 
an 77 x 77 matrix by a column 77-vector can be viewed as the product of an 77/A; x n/k matrix 
of k x k matrices with a column 77/fc-vector of column fc-vectors. Since each product of 
a k x k submatrix by a A: -vector is a function of O(k) parameters, compute it with table 
lookup in time O(k) and space 0(fclog \1Z\). Add two of these matrix- vector products by 
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c = a,\b\ + a2&2 + 0363(1110(12) 

Figure 1 0.20 A branching program to compute the inner product of two 3-vectors over the set 
1Z of integers modulo 2. 



rooting a table-lookup program at each of the 0(|7£| ) final states of a first table-lookup 
program. Coalesce final states corresponding to the \1Z\ sums of the two column fc -vectors. 
This program has 0(|7£| ) vertices or space (9(fclog \1Z\) and time 0(k). n/k such stages 
increase the number of vertices and time each by a factor of n/k. Since this process is 
then repeated for each of the n/k rows of the block matrix, the space and time used are 
(9(fclog \1Z\ + log(n/fc)) and 0(n 2 /k), respectively. ■ 



10.13.2 Integer Multiplication 

To derive space-time lower bounds for integer multiplication, we could invoke the reductions 
from this problem to cyclic shifting, as was done in Section 10.5.3. However, as shown in 
Section 10.10, the space-time product for cyclic shifting is only O(nlogn). Thus, we are 
forced to use another reduction to obtain a strong space-time product lower bound, namely a 
reduction from integer multiplication to convolution. 
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Let Z2 be the ring of integers modulo 2. As shown in Problem 6.20, the integer multi- 
plication function / mult : B 2n i— > B 2n contains the convolution function over / CO nv : 
z 2n/io g n ^ z 2«/iog« .j-j^ by Lemmas i .11.2 and 10.13.1 the following holds: 



THEOREM 1 0. 1 3.2 There is an integer n > such that for n > Uq the time T and space S 

An) 
'mult 



used by any general branching program for binary integer multiplication f^ it '■ B 2n i— > B 2n must 



satisfy 

ST = n(n 2 J log 2 n) (10.10) 

This lower bound can be achieved to within a factor ofO(log n) for space fi(logn) < S < 
0(n). 

Proof Since the integer multiplication function depends on In variables, it can be com- 
puted via table lookup with space 0(n) and time O(n), thereby meeting the lower bound 
to within a factor of O (log n). 

At the limit of small space, S = 0(logn), the integer multiplication algorithm of 
Section 10.5.3 provides a branching program. Since at most [log 2 n] bits suffice for the 
carry from one power of 2 to the next, a branching program based on this algorithm has 
at most 0(2' g2 " ' ) vertices at each of n 2 levels. Thus, this program uses time 0(n 2 ) and 
space 0(log n), achieving the lower bound to within a factor of O (log n). 

We sketch a procedure to fill in the range of space between these extremes and ask the 
reader to complete the details. (See Problem 10.39.) Assume that k divides n and represent 
each n-bit binary number as an (n/fc)-component base-2 number. As in the standard bi- 
nary integer multiplication algorithm (where k = 1), form n/k (n/fc)-component numbers 
through multiplication and shifting of consecutive base-2 components, as suggested below: 

V 3 U V 2 U ViU V U 

V3U1 V2U1 ViU\ v ui 

W3U2 V 2 U2 V\U2 V0U2 

U3U3 V2U3 W1U3 v u 3 

Here u r and v s are base-2 numbers. Multiply two such numbers through table lookup in 
time and space O(k). Extend the algorithm for the base-2 case by replacing each subpro- 
gram that multiplies two binary numbers by the table lookup program to multiply base-2 
numbers. This new program adds products to a running sum of length 0(log n) bits. Thus, 
it uses space 0(k + log n) and time 0(n 2 /k), giving a space-time product of 0(n 2 log n) 
for k > log n. ■ 

10.13.3 Matrix- Vector Product 

The matrix-vector product function f \^ x '■ 7Z> n >— > Tt n computes the n-tuple y from the 
n-tuple x for a fixed n x n matrix A over 1Z according to the rule 

y = Ax 

where y = YltZo a j,k x k for < j < n — 1. 
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LEMMA 10.13.2 Let A beaj-oknxn matrix over 1Z for some < 7 < 1/2. Then the matrix- 
vector product function /_4 Xa . : TU 1 !— > 7?" m (1, 7. 7, 7. ^-distinguishable where r(6) = n. 

Proof To show that /\ xa . is (1, 7, 7, 7, T)-distinguishable, select any a < [771] inputs and 
any fo < [7/1] outputs. If the ith input is chosen and it has value Ui, introduce the equation 
Xi = Ui. Let B be the a x n coefficient matrix defining these equations; that is, Bx = u, 
where B contains the jth row of the n x n identity matrix if the jth variable is among the 
selected inputs. 

" A 



Consider the (n + a) x n matrix C 



We show that it has rank a + 76. The 



B 

submatrix D of A consisting of the intersection of those columns not selected by inputs (of 
which there are n — a > n — \jn~\ ) and rows selected by outputs (of which there are b) 
has rank 76 because A is 7-ok. Thus, 76 of the n — a columns of A not selected by inputs 
and the a non-zero columns of B are linearly independent. Thus, the submatrix E of C 
consisting of the selected rows of B and the rows of D has rank a + 76. 

The number of n-tuple input vectors x consistent with the linear system Ex = d is 
|.4| n_a_7 , as we show. Without loss of generality assume that the first a+76 columns of E 
(call it F) are linearly independent. (Permute the columns, if necessary, so that this is true.) 
Fix the values of the b realizable outputs. Then for each assignment to inputs corresponding 
to the last n — (a + 76) columns there are unique values for the first a + 76 inputs, due to 
the non-singularity of F. Thus the number of assignments to the last n— (a + 76) columns 
that are consistent with values for the a inputs and b outputs is |.A|™ — ° r . ■ 

Invoking Corollary 10.1 1.1 yields the following result. 

THEOREM 1 0. 1 3.3 Let A be a "f-ok n x n matrix over 1Z for some < 7 < 1/2. Then there 
is a constant < 7 < 1/2 and an integer n such that for n > no the space S and time T used 

by any general branching program for the function f^xx '■ ^" *~* ^™ must satisfy the following 
lower bound when T > n: 

ST = n(n 2 log \K\) 

This lower bound can be met to within a factor ofO(log n) for log n < S < n. 

Proof The lower bound follows from the application of Theorem 10.11.1. 

The matrix-vector product Ax for an n x n matrix A can be done with a branching 
program for the standard algorithm as follows: Compute the inner product of the ith row 
with the column x for 1 < i < n. The inner product of two n-tuples can be computed 
with a branching program having 0(?7.|7?.| 2 ) vertices, as suggested in Fig. 10.20. (This is 
true even if A is not fixed.) n branching programs for inner products can be concatenated to 
form one branching program to multiply annxn matrix with an n-vector. This branching 
program uses space (3(logn + log \1Z\) and time 0(n 2 ), thereby meeting the lower bound 
to within a factor of O (log n). 

A matrix-vector product for a fixed matrix (this case) can also be computed by table 
lookup in space 0(nlog \1Z\) and time O(n) since this function has n variables. 

To bridge the gap between these two results, compute the matrix-vector product using a 
hybrid algorithm similar to that used for convolution in the proof of Theorem 10.13.1. ■ 
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10.13.4 Matrix Multiplication* 

The space— time lower-bound argument for matrix multiplication in the branching program 
model uses ideas similar to those used for matrix-vector multiplication. 



K 2 



LEMMA 10.13.3 The matrix multiplication function f A ^ 

(1, 1, 1,7 / '4, ^-distinguishable for some < 7 < 1/2, where T(b) 



1— > 1Z" over the ring 1Z is 
771^/6/2. 



Proof Consider the subfunction of /j xB obtained by choosing A and B from the set of 
n x n 7-nice matrices. By Lemma 10.1 1.2, a lower bound on the space— time product for 
this subfunction provides a lower bound to the matrix multiplication function. 

Consider some a < 2n 2 selected inputs and some b < n 2 selected outputs such that 
a < t(6); that is, (a/771) < 6/2. The outputs correspond to entries of the product matrix 
C = A x B. Let row i of C be a heavy row if at least 771 of the a selected inputs are in 
row i of A. Similarly, let column j of C be a heavy column if at least 771 of the a selected 
inputs are in column j of B. A row or column of C is light otherwise. (See Fig. 10.21.) 

There are at most a/771 heavy rows and a/77?, heavy columns of C. We now show that 
either a) at least b/A of the selected outputs fall into light rows of C or b) at least b/4 of 
the selected outputs fall into light columns of C. Suppose not. Then both statements are 
false and less than b/4 of the selected outputs fall into light rows and less than b/4 of the 
selected outputs fall into light columns of C. It follows that at least 36/4 of the selected 
outputs fall into heavy rows. Of these at most (0/771) fall into heavy columns, since this is 
the maximum number of entries of C that could be in both heavy rows and columns. The 
remaining selected outputs in these rows (of which there are less than 6/4) fall into light 
columns. However, because the entries in each row fall into either heavy or light columns, 
the number of selected outputs that are in heavy rows is less than (a/771) 2 + 6/4. But this 
is less than 36/4 since a < r(6) = 771 Wb/2, contradicting the stated hypothesis. 

Without loss of generality, assume that b holds. (If not, a holds and at least 6/4 selected 
outputs fall into light rows of C or into light columns of the transpose C .) Represent the 



C 



Figure 10.21 Identification of heavy rows and columns of matrices. 
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product C = A x B as follows: 
A 



C l 



C n 

Here B l and C % are the ith columns of the matrices B and C, respectively. Let B and 
C denote the columns of these columns, respectively, and let D denote the block diagonal 
matrix on the left. 

We show that at most \lZ\ ln - a -~f b / 4 of the matrix pairs (A, B) are consistent with any 
assignment to any set of a selected inputs and values of any b selected outputs. 

Of the a selected inputs, let a,\ be drawn from A and a 2 be drawn from B, where 
a = a\ + a 2 . The number of 7-nice matrices A consistent with the a,\ selected inputs from 
A is at most \1Z\ n ~ ai . We now bound the number of matrices B that are consistent with 
the values of selected inputs and outputs. 

Let A be fixed and 7-nice. Consider just the (at least b/4) selected outputs that fall into 
light columns of C. Every value for B consistent with the selected inputs and these outputs 
must satisfy the following linear equation: 



E 
F 



HB 



r 
c 



Here E consists of the b rows of D corresponding to selected outputs and F is a submatrix 



of the 



77 x n 



identity matrix consisting of the a 2 rows corresponding to selected inputs 



in B. c is the column of values for the selected inputs in B and r is a column of selected 
outputs of C that fall into light columns. The number of values for B consistent with a 
fixed A and the values of the selected inputs and outputs is no more than the number of 
solutions B to these equations, since we are ignoring outputs in heavy rows. 

We now show that H has rank at least a 2 + "/b/4. A column of H is queried if a column 
of E contains a selected input or the corresponding row of B contains a selected input. a 2 
of these columns correspond to selected inputs in B and are linearly independent because 
the corresponding columns of F are linearly independent. Consider the unqueried columns 
of H . These columns in F are zero columns. Thus, consider these unqueried columns in 
E. Consider k rows in E that come from a common copy of A on the diagonal of D. The 
column B l of B corresponding to this copy of A is light (it has fewer than 771 selected 
entries) because the corresponding column of C is chosen to be light. Thus, this copy of A 
has at least 71(1 — 7) unqueried entries, or at least 77(1 — 7) of its columns are unqueried. 

Since A is 7-nice, the unqueried columns of this copy of it have rank at least min (fc, 777). 
Because there are no dependencies between columns in distinct copies of A in D, the num- 
ber of linearly independent unqueried columns of E is minimal if they all fall in as few 
common copies of A as possible, because then min (fc, 777) = 777. It follows that the un- 
queried columns of E have rank at least "/b/4. Since the queried columns have rank at least 
a 2 , the columns of H have rank at least a 2 + "/b/4. It follows from an argument given 
in the proof of Lemma 10.13.2 that the number of solutions B to this system is at most 
\lZ\ n - a 2-7&/4_ Since there are at most \1Z\ n ~ ai matrices A that are 7-nice and consistent 
with the a\ selected inputs in A, it follows that the number of pairs consistent with values 
of the selected inputs and outputs is at most \lZ\ ln - a -~f b / 4 > t he desired conclusion. ■ 
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This result provides a lower bound on the space and time for matrix multiplication. The 
upper bound cited below is obtained by another hybrid algorithm that mixes a branching 
program for the standard algorithm with one for table lookup. 

THEOREM I 0. 1 3.4 There is an integer n > such that for n > no the space S and time T 
needed to compute the matrix multiplication function f A ^ B :lZ n >—>lZ n over the ring 1Z using 
a general branching program satisfies the inequality: 

ST 2 >^n 6 log 2 \TZ\ 

for some < 7 < 1/2 when T > n . This lower bound can be achieved up to a multiplicative 
factor of O (log n) for space in the range fl(logn + log \A\) < S < 0(nlog\A\). 

Proof The lower bound follows from Theorem 10.11.1 and Lemma 10.13.3 by letting 
a = \ff n /4T\, since this value of a satisfies the two conditions a < T(ma/2T) = 



'yn^/ma/4T and a < 2n 2 when T > n 2 . 

At the extreme of large space, namely S = 0(n ), the upper bound follows from 
a branching program for table lookup that has one level for each of the 2n variables in 
the matrices A and B and the fact that there are \lZ\ 2n pairs of such matrices over the 
ring 1Z. Consequently, the branching program has at most 0(|7?.| 2rl ) vertices and space 
0(n 2 log \1Z\). It uses 0(n 2 ) steps. 

At the extreme of small space, namely S = f2(logn + log |-4|), we use a branching 
program for the standard matrix multiplication algorithm that forms n 2 inner products of 
rows and columns of the two matrices. As discussed in the proof of Theorem 10.13.3, a 
branching program can be constructed to form the inner product of two n-tuples that has 
0(n|7?.| 2 ) vertices; that is, space f2(logn + log |^4|) and time 0(n). Concatenating n 2 of 
these programs, one for each of the n entries in the product matrix, we have a branching 
program with space f2(log n + log |.4|) and time 0(n 3 ). 

To fill in the gap between these extremes, the method applied in Theorem 10.13.3 can 
be used, as the reader can demonstrate. (See Problem 10.40.) ■ 

10.13.5 Matrix Inversion 

As an intermediate step to deriving a space-time product lower bound on matrix inversion, we 
derive a lower bound for the product of three n x n matrices. This is done by first deriving 
an alternate representation for this product in terms of the Kronecker product of two matrices. 
Kronecker products are defined in Section 10.12. 

LEMMA 10.13.4 Let A, B , C , and D be n x n matrices over a commutative ring. The following 
two equations define the same set of mappings from entries of A, B, and C to entries in D: 

D = ABC 

E=(A®C T )B 

where B and E are n 2 x 1 column vectors obtained by concatenating the transposes of the rows of 
the matrices B and D, respectively. 
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Proof Let E = (A <E> C T )B. The goal is to show that the results in the n 2 x 1 column 
vector E are the same as those in the n x n matrix D but in a different order. In particular, 
we show that the ni + j entry in the former, namely e n i+j : i, is equal to the (i,j) entry in 
D, namely d{j . 

Given a matrix F, let fj denote its entry in the zth row and jth column. Let /^_ 
and /_ j denote the ith row and jth column of F, respectively. Let rows and columns of 
matrices be numbered consecutively from zero. 

The matrix A £§> C consists of blocks of n consecutive rows with the ith block con- 
taining [ai : \C , a^C , . . . , ai <n C ]. Thus, the ni + jth entry of E, namely e n i+j t \, 
is the jth entry in the product [di^C , CLi,iC , . . . , ai, n C ]B, as shown below, where 
(c_j) (bk -) is the inner product of the row vector (c_ j) with the column vector 

(K-V- 

e n i+j,i = y^a ifc (c-j) T (fr fc ,-) T 
fc=o 

n— 1 n— 1 

= y^ 22 a j<k c i,j b k,i 

k=0 1=0 
n—\ n—\ 
= 22 X/ a i,k b k,lClj 
k=0 1=0 

= di 



'i.j 



This is the desired conclusion. 



With this as background, we state the space-time results to compute the product of three 
matrices. 

THEOREM 10. 13.5 There is an integer no > such that for n > tiq the time T and space 
S used by any general branching program to compute the product of three n x n matrices over a 
commutative ring 7Z must satisfy the following inequality: 

st = n{n 4 log \n\) 

Proof Given a general branching program to compute ABC, no more space or time are 
used when the matrices A and C are given specific values. Let them each be 7-nice for 
some < 7 < 1/2. The existence of such matrices is established in Lemma 10.12.1. 
From Lemma 10.12.2 we know that the matrix A (g> C is 7 -ok. The result follows from 
Theorem 10.13.3 since A ® C T is n 2 x n 2 . ■ 

We are now prepared to state space-time bounds for matrix inversion. 

THEOREM 1 0. 1 3.6 There is an integer n Q > such that for n > n the time T and space S 
used by any general branching program to compute the inverse of a non-singular n x n matrix over 
a commutative ring 1Z must satisfy the following inequality: 

ST = n{n 4 log \TZ\) 

This lower bound can be achieved to within a multiplicative factor over the range fi(n ) < T < 
0(n 5 ). 
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Proof Let n be a multiple of 4. The lowet bound follows by teducing matrix inversion to 
the computation of the product of three arbitrary n/4 x n/4 matrices, as shown below: 
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The upper bound for T = 0(n 2 ) is obtained by table lookup using an algorithm of 
the kind described in the proof of Theorem 10.13.3. For T = 0(n ), the matrix inversion 
algorithm based on the LDL decomposition of a symmetric positive definite matrix of 
Section 6.5.4 can be used. For intermediate values of time, a hybridized algorithm based on 
the inversion of block matrices provides the stated upper bound. ■ 

10.13.6 Discrete Fourier Transform 

The discrete Fourier transform (DFT) and the fast Fourier transform algorithm are described 
in Sections 6.7.2 and 6.7.3. In this section we derive upper and lower bounds on space- 
time tradeoffs for this problem. The lower bound follows from the result for matrix-vector 
multiplication and the fact that the coefficient matrix for the DFT is (1/4) -ok. 

LEMMA I 0. 1 3.5 Consider the n-point DFT over a commutative ring that has a principal nth 
root of unity. It is defined as a matrix-vector product with [w 11 ] as its n x n coefficient matrix. 
This matrix is (1/4) -ok. 

Proof We use the fact, shown in Theorem 10.5.5, that the submatrix of W = [w 1 - 1 } con- 
sisting of any fc rows and any k consecutive columns is non-singular. We show that any p x q 
submatrix B of W, with p < [n/4] and q > n — [n/4] , has rank at least p/4. 

Let / denote the row indices of the submatrix B and let J denote its column indices. 
Let C be the submatrix of W with row indices in /. Divide the columns of C into \njp\ 
groups each containing p columns except possibly the last which has at most p columns. We 
claim that some group has at least p/2 columns in common with B. Suppose not. Then 
every one of the \n/p~\ groups has at most (p — l)/2 columns in common with B. Thus 
B has at most x{p) — \ n /p] (p ~ l)/2 columns. We show that \{p) < n — (n + 3)/4 < 
n — [n/4] . But this is a contradiction because B has at least n — [n/4] columns. Since 
\n/p~\ < (n + p — l)/p, if (n + p — l)(p — l)/2p < n — (n + 3)/4, the following holds 
after multiplying both sides by 2p: 

i w x 3p(n — 1) 
(n + p- l)(p- 1) < or 

/(n+1) 
-n + 1 < p[ - — - — - - p 



It suffices to show that the right-hand side of the last equation is positive. But ( (n+ 1 ) /2) — p 
is positive since p < \n/4~\ < (n + 3)/4 < (n + l)/2 for n > 1. ■ 

THEOREM 1 0. 1 3.7 There is an integer no > such that for n > no the n-point DFT over a 
commutative ring 7Z requires space S and time T with a branching program satisfying the following 
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lower bound: 

ST = n(n 2 log \n\) 

This lower bound can be achieved to within a constant multiplicative factor. 
Proof The upper bound follows by applying Lemma 10.9.3 and Theorem 10.5.5. ■ 

10.13.7 Unique Elements 

We now derive a lower bound on the space-time product for the sorting problem by reducing 
sorting to the unique-elements problem. The unique elements problem takes a list of values 
and returns in any order a list of the non-repeated elements among them. 

DEFINITION I 0. 1 3. 1 Let 1Z be a set with at least n distinct elements. The function f, ,„■„„„ : 

J J unique 

lZ n i— > 2 defines the unique elements problem where 2 is the power set of lZ n and 

/unique ( x ) * s ^ e set of non-repeated elements in the input string x. 

We emphasize that no order is imposed on the outputs of / un j . Thus, if a set of values 
appears in the output, their position in the output does not matter. 

From Lemma 10.11.2 it follows that a lower bound to ST can be derived by restricting 
the domain and discarding outputs. We restrict the domain by restricting each input variable 
to values in a subset S C 1Z containing n elements. We also restrict input tuples to the 
set T> containing at least n/(2e) unique values (e is the base of the natural logarithm). In 
the following lemma we show that \V\ > |5| n /(2e — 1) = <pn n , where </> = l/(2e — 1). 
On inputs in T> the function / u " iquc has at least n/(2e) unique outputs. We define the 

subfunction / 1( ^ trictod : S n i— > <S m , m = n/(2e), of / u ™ ; to be the subfunction obtained 
by restricting its inputs to T> C S n and deleting all but the first n/(2e) outputs, which are all 
unique. 

LEMMA I 0. 1 3.6 Let S be a set ofn elements. The fraction <p of the input n-tuples over S n 
containing n / (2e) or more unique elements exceeds l/(2e — 1). 

Proof We use simple probabilistic arguments. Assign each n-tuple over S n probability 
1/n™. Let u(x) be the number of unique elements in x. Let Xi(x) have value 1 if the ith 
element of S occurs uniquely in x and value otherwise. Then 

n 

u(x) = ^X % (x) 

Let E[u] denote the average value of u(x) (the sum of u(x) over x weighted by its prob- 
ability). Because the order of summation can be changed without affecting the sum, we 
have 

n 

E[u{x)] = Y J E[X l {x)] 

i=\ 

E[Xi(x)] is also the probability that Xi = 1. IfXj = 1, then each of the other components 
of a; can assume only one of n— 1 values. Since the ith value can be in any one ofn positions 
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among input variables and since for each position that it occupies there are (n — l)™ -1 
ways to fill the remaining n — 1 positions so that the ith value is unique, we have that 
E[Xi] = f(n) where f(n) = n(n - l)"" 1 /n n = (1 - \/n) n /{\ - 1/n). But f(n) is 
a decreasing function of n, as is shown by calculating its derivative and using the inequality 
(1 — x) < e~ x (see Problem 10.5). The limit of /(n) for large n is e _1 because in the limit 
of small x the function e~ x has value 1 — x. It follows that E[u(x)\ > n/e. 

Let it = _P r [u(a;) > n/(2e)] be the fraction (or probability) of the input n-tuples 
for which u{x) > n/(2e)). Because u(x) < n, it follows that irn + (1 — 7r)n/(2e) > 
£[u(a;)] > n/e, from which we conclude that 7r > l/(2e— 1). (This is known as Markov's 
inequality.) ■ 

LEMMA 1 0. 1 3.7 Let \S\ = n. Then f^ tlicted : 5" h-> S"», m = n/(2e), is (<j>, A, ^ »/, r)- 
distinguishable for (f> = l/(2e — 1), A = /X = 1, f = (1 — l/(2e))/log 2 n, andrib) = n. 

Proof If /restricted ' s (^' ^' A*> ^> t) -distinguishable for = l/(2e — 1), A = \x = 1/2, 
;/ = (1 — l/(2e))/ log 2 n, and r(6) = n, then for at least 4>n n input tuples and any a < Xn 

input and b < [im output variables and specified values for them, / re "^ricted ^ as at most 
n n-a-vb _ n n-a e -(i-\ / (2e))b m p Ut j7,- tU pl es th at a re consistent with these assignments. 

The order of output values to / rcstr i ctG( j is irrelevant. 

Let B be the values of the b selected and specified unique outputs, b < m, and let A 
be the values of the a selected and specified input values. The k values in B — A appear in 
input positions that are not specified, r = n — k — a inputs are in neither A nor B. We 
overestimate the number of patterns of inputs consistent with the a inputs and b outputs 
that are specified if we allow these a inputs to assume any value not in B, since all values in 
B ure unique. Thus, there are at most (n — b) r ways to assign values to these r inputs. The 
k values in B — A are fixed, but their positions among the r + k non-selected inputs are 
not fixed. Since there are (r + k)\/r\ ways for these ordered k values to appear among any 
specific ordering of the remaining r non-selected inputs (see Problem 10.6), the number Q 
of input patterns consistent with the selected and specified a inputs and b outputs satisfies 
the following inequality: 

r! 

Here r + k = n— a < n and k < b. Below we bound (r + k)\/r\ by (r + k) k and use the 
inequality (1 — x) < e~ x : 

Q< {r + k) k {n-b) r < n r+k (l - -Y (l - -j 

< n n ~ a e~ ( ka / n + rb / n ) < n n—a—(ka/n+(n—a—k)b/n) 

The exponent e(a, b, k) = ka/n + (n — a — k)b/n is a decreasing function of a whose 
smallest value is (1 — k/n)b. In turn, this function is a decreasing function of k whose 
smallest value is (1 — b/n)b > (1 — l/(2e))6. As a consequence, we have 



Q < n n-a e -(l-l/(2e))b 

, (j,, v, r)-d 
v= (1 - l/(2e))/log 2 n,andr(6) = n. 



It follows that /restricted ls (^' ^' ^' v ' T ) -distinguishable for (j) = l/(2e — 1), A = \x = 1, 
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b := 0; 

for j : = 1 to \n/S~\ 
{b = ( j — 1)5 on the jth iteration.} 
begin 

for i : = 1 to 5 

C\i] := 0; 
for i : = 1 to n 

if b < Xi < b + 5 then 
begin 

k := Xi — b; 

i£C[k] <2thenC[fc] :=C[k}+ 1; 
end; 
for i : = 1 to 5 

if C[i] = 1 then print b + i; 
b : = 6 + 5; 
end 



Figure I 0.22 A RAM program for the unique-elements problem over the set { 1, 2, . . . , n} 
when n > S > O(logn). The input to the program is the n-tuple x in which Xi is the ith 
entry. The program uses space O(S). 



Invoking Theorem 10.11.1, we have a quadratic space-time product lower bound. The 
RAM program for the unique elements problem given in Fig. 10.22 can be converted to a 
branching program to obtain an upper bound on the space-time product needed for this 
problem, as shown in Theorem 10.13.8. 

THEOREM 10.13.8 Let \R\ > n. There is an integer no > such that for n > no and 
S = il(logn) the time T and space S used by any general branching program for the unique 

elements function f^L ue '■ TV 1 '<—- > 2 must satisfy 

ST = n(n 2 ) 

This lower bound can be met to within a constant multiplicative factor for inputs drawn from the 
set {1, 2,3,..., n}. 

Proof The lower bound follows directly from Theorem 10.1 1.1. The upper bound follows 
from an analysis of the branching program that results from conversion of the RAM program 
in Fig. 10.22. The RAM program makes [n/5] passes over the input data. On the jth pass 
the program examines input values in the range [(j — 1)5, ... , jS] and determines for each 
value whether there are zero, one, or more than one instances of it in the input. 

The program uses an 5-element one-dimensional array C[1..5] that it initializes to zero 
at the beginning of each pass. If on the jth pass the ith input variable, Xi, is in the interval 
[(j — 1)5, . . . ,jS], the array element associated with it, namely C[xi — (j — 1)5], is 
incremented unless it already has value 2. At the end of the jth pass, if the array element 
C[i] has value 1, the program prints out the value jS + i, namely, the value of an input that 
appears only once in the input. 
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The reader is asked to show that the program of Fig. 10.22 can be converted to a branch- 
ing program of space O(S) and time 0(T). (See Problem 10.41.) ■ 

The program of Fig. 10.22 relies on the fact that input variables are drawn from the set 
{1,2,3, ... ,n}. If the set from which they are drawn is much larger, say {1,2,3, . . . ,n c }, 
c > 1, the outer loop is executed 0(n c /S) times and its total running time is 0(n c ). Thus, 
the program is not optimal in this case. 

10.13.8 Sorting 

The sorting problem is described in Section 6.8. The general sorting problem is defined by 

a function / sort : lZ n i— » lZ n that rearranges the values of input variables so they are in 
descending order. Given a branching program for sorting, we show below that a branching 
program for the unique-elements problem can be obtained with a small additional amount of 
space. As a consequence, the space-time product lower bound for unique elements applies to 
the sorting problem. We also give a nearly matching upper bound. 

THEOREM 10.13.9 Let \R\ > n. There is an integer no > such that for n > no and 
S = f2(logn) the time T and space S used by any general branching program for the sorting 
function / 80rt : TV 1 i— > TZ n that reports its outputs in descending order must satisfy 

ST = n{n 2 ) 

This lower bound can be met to within a constant multiplicative factor for inputs drawn from the 
set {1, 2,3,..., n}. 

Proof Given a branching program for / s " rt that uses space S, we use it to construct a 
branching program for / u " quc that uses space S + O(logn) = 0(5). Since / u " que 
requires space that is il(n 2 /T), the same lower bound applies to sorting. 

The branching program for / sort generates its sorted outputs in descending order. By 
analyzing the outputs the unique elements can be found. Store the last output I along with 
a bit b that is 1 if I is so far the only occurrence of this value and otherwise. If the next 
output value is the same as I, set b to 0. If it is different from I and 6=1, produce I as 
an output, replace I with the last output, and set 6 to 1. Otherwise, do not produce an 
output. 

Given a branching program II for sorting, we describe a branching program for unique 
elements that uses modified copies of II. If more than one output appears on some edge 
in II, modify it (yielding II*) by replacing edges producing more than one output by a 
sequence of edges each producing one output separated by vertices testing an arbitrary in- 
put. This increases the number of vertices in II by a factor of at most n and adds at most 
log 2 n to its space. Now make 2\1Z\ additional copies of II*, two for each value in 1Z, a 
"one" copy if the value is the first encountered in the sorted output and a "zero" copy if it 
is not. 

Consider an edge in II* or one of its copies that produces an output (call it v). There 
are several cases to examine: the current copy of II* is a) the original copy, b) a "one" copy, 
or c) a "zero" copy. In case a), redirect the edge to the same vertex in the "one" copy of II* 
associated with v. In case b), if v is different from the value c associated with the current 
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copy of II*, output c and redirect the edge to the same vertex in the "one" copy of II* as- 
sociated with v. In case c), if v is the same as the value associated with the current copy of 
II* , produce no output; otherwise also produce no output but redirect the edge to the same 
vertex in the "one" copy of II* associated with v. The new branching program has at most 
2n + 1 copies of II*, thereby increasing its space by an additive term of size 0(log n). The 
lower bound on ST for the sorting problem follows. 

The upper bound on ST for the sorting problem is obtained by constructing a family of 
branching programs, one for each value of S. We begin by constructing a "full" branching 
program for the case S = 0(n). Let the variables in the input string be X\, Xj, ■ ■ ■ , x n and 
let them be tested in sequence. Thus, the root is labeled X\ and has n successors, each of 
which tests x%. There is one successor for each vertex labeled with X2 for each way two num- 
bers can be chosen with replacement from the set { 1, 2, . . . , n}. As shown in Problem 10.7, 
there are N(n, k) ways in which k numbers can be drawn from a set of n elements with 
replacement where the order among the numbers is unimportant and 

»r/ n fn + k-l 
N{n,k)={ k 

Thus, N(n, 1) = n and N(n,2) = (n + l)n/2. The successors to vertices labeled X2 are 
labeled CC3. They have N(n, 3) successors, and so on. At the fcth level there are N(n, k) suc- 
cessors. Since N(n, k) < 2 n+k ~ 1 , it follows that for k < n the above branching program 
has 0{2 2n ) vertices or space S = 0(n). It also has time T = n and space-time product 
0{n 2 ). 

To construct a branching program for space S = 0(n), we use 0(n/S) pruned copies 
of the full branching program described above. The idea behind the pruning is the fol- 
lowing: we scan the input list looking for variables with values in the set { 1, 2, ... , S}. If 
there are O(S) of them, we record the number of values of each type and produce them in 
sorted order. However, if there are more than O(S) elements in this range, as we examine 
additional inputs we reduce the size of the range so that only O(S) space is used to carry 
the number of values of variables encountered. (This space is represented by 2 ' ' vertices 
in the branching program.) On each pass through the input either we reduce the size of 
the range by O(S) or reduce the number of outputs that must be produced by the same 
amount. Thus, after 2nj S passes the input is sorted. Since each pass tests the value of each 
variable, the time is 0(n 2 /S). 

It is not difficult to convert the above schema into a branching program. The goal is to 
have no more than about 2 vertices on each level of the branching program. The branching 
program will consist of 0(n/ S) copies of the full branching program, each having n levels. 
Thus, the branching program will have 0{n 2/5) vertices or space O(S). 

We order vertices at each level in the branching program, placing those with smaller 
input values to the left. We remove vertices at the jth level that correspond to input values 
larger than 5 as well as those to the right of the first 2 vertices on the jth level. Each edge 
in the first full branching program that is directed into a removed vertex is redirected to the 
root of the next copy of the branching program. The second copy of the full branching 
program is pruned to remove the vertices appearing in the first copy as well as those reached 
on inputs outside the range [5+1,5 + 2,..., 25] . The edges directed to removed vertices 
are redirected to the root of the third copy of the full branching program. A similar process 
is applied to each copy of the full branching program. ■ 
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Problems 

MATHEMATICAL PRELIMINARIES 

10.1 Show that the the pyramid graph on m inputs, P(m), has m(m + l)/2 vertices. Let 
n = m(m + l)/2. Show that m > \/2n — 1. 

10.2 Show that the following inequalities hold for integers m and x: 

mix < \m/x~\ < (m + x — l)/x 
(m — x + l)/x < [m/x\ < m/x 

10.3 Suppose that p log 2 p < q for positive integers p, q > 2. Show that p < 2q/ log 2 q. 

10.4 For n positive integers X\, Xi, . . ., x n , show that the following inequality holds between 
the geometric mean on the left and the arithmetic mean on the right: 

(xix 2 ■ ■ ■ x n ) '" < (xi + x 2 + ■ ■ ■ + x n )/n 

10.5 Show that the inequality (1 — x) < e~ x holds for x < 1. 

10.6 Show that there are (r + k)\/r\ ways for k ordered values to appear among r distinct 
ordered items. 

10.7 Show that there are N(n, k) = ( n+ k ~ 1 ) < 2 n+k ~ l ways to choose with repetition k 
numbers from a set A of size n where the order among the numbers is unimportant. 
Choosing with repetition means that a number can be chosen more than once. 

Hint: Without loss of generality, let A = {1,2, ... ,n}. Since order is unimportant, 
assume the chosen numbers are sorted. Let each chosen number be represented by a 
blue marker. Imagine placing the blue markers on a horizontal line. For 1 < i < n — 1, 
place a red marker between the last blue marker associated with the number i and the 
first blue marker associated with the number i + 1, if any. This representation uniquely 
determines the number of elements of each type chosen. How many ways can the red 
markers be placed? 

10.8 Show that a complete balanced binary tree on 2 leaves has 2—1 vertices including 
leaves and that each path from a leaf to the root has k — 1 edges and k vertices. 

THE PEBBLE GAME 

10.9 Consider the circuit shown in Fig. 2.15. Treat each gate and each input vertex as a 
vertex. Give a good pebbling strategy for this graph. 

10.10 Give a pebbling strategy for the m-input counting circuit in Fig. 2.21(b) that uses 
0(log m) pebbles and O(m) steps. Determine the minimum number of pebbles 
with which the circuit can be pebbled. Determine the number of steps needed with 
this minimal pebbling. 
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SPACE LOWER BOUNDS WITH PEBBLING 

10.1 1 Consider the FFT graph F^ ' on m = 2 inputs. Show that the subgraph connecting 
inputs to any one output is a complete binary tree on m leaves. 

10.12 Consider a directed acyclic graph with n vertices, some of which have out-degree greater 
than 2. (a) Show that if each vertex of out-degree k > 2 is replaced by a binary tree 
with k leaves and edges directed from the root to the leaves, the number of vertices in 
the graph is at most doubled, (b) Show that replacing vertices with in-degree greater 
than 2 with binary trees also at most doubles the number of vertices in the graph. 

EXTREME TRADEOFFS WITH PEBBLING 

10.13 Let N(k) be the number of vertices in the graph H k discussed in Section 10.3. Show 
that the following recurrence holds for N(k): 

N{k) = N{k- l) + 4fc + 3 

Show that N{k) = 2k 2 + 5k - 6 for k > 2 since N(2) = 12. 

10.14 Construct a new family {G k } of graphs with fan-in 2 at each vertex from the graphs 
{-fffc} by replacing the tree in Fig. 10.4 by a pyramid graph in k inputs and the bipartite 
graph with the graph E k shown in Fig. 10.23. Show that each output of E k can be 
pebbled with k pebbles but that after pebbling any one output there is at least one path 
without pebbles between the input and every other output. Show also that with k + 1 
pebbles Ek can be pebbled without repebbling any vertex. 

Let Tk(S) be the number of steps to pebble G k with S pebbles. Using the above facts, 
show the following: 

a) N{k) = \G k \ = 0(n 4 ) 

b) S min {G k ) = k 

c) T k (k+l) = N{k) 

d) T fc (fc)=2 n ( JV ( fe )' /4l °s JV ( fc )) 



Outputs 



V 




Uk+l 



Figure 1 0.23 The graph Ek used in the construction of the family {Gk}- 
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SPACE-TIME LOWER BOUNDS WITH PEBBLING 

10.15 Let A be a 7-nice n X n matrix over a ring 1Z for some < 7 < 1/2. Show that the 

matrix- vector multiplication function f^x : ^" l—> ^™ t ^ lat ma P s tne m put n-tuple 
a; to the output n-tuple Ax is ( 1, n 2 + n, n, jn) -independent. 

10.16 Use Lemma 10.12.1 and the result of the previous problem to show that for almost 
all n X n matrices A every straight-line program for the matrix-vector multiplication 

function f\j. x '■ TZ n i— > TZ n over the ring 1Z requires space S and time T satisfying 
the inequality 

(S+l)T = n{n 2 ) 

Furthermore, show that a straight-line program for matrix-vector multiplication can be 
realized with space S = 3 and time T = n(2n — 1), that is, with 

(S+\)T=0{n 2 ) 

10.17 Linear systems are described in Section 6.2.2. A linear system of n equations in n 
unknowns x is defined by an (n x n)-coefficient matrix A and an n-vector b, as 
suggested below: 

Ax = b (10.11) 

The goal is to solve this equation for x. If A is non-singular, such a solution exists for 
each vector b. Let fY-i x h '■ ^™ + ™ l— ¥ ^" denote the linear system solver function 
that maps the matrix A and the vector b onto the solution x when the matrix-vector 
multiplication is over the ring 1Z and A is non-singular. 
Show that every pebbling strategy for every straight-line program to compute the linear 

system solver function f\J, „ i, : TZ n | — ► TZ n over the ring 1Z for n even requires space 
S and time T satisfying the following inequality: 

(S'+l)T>n 3 /24 

Hint: Would it be possible to violate the lower bound on (S+ \)T for matrix inversion 
given in Problem 10.25 if a DAG for the linear system solver function can be pebbled 
with S pebbles in too few steps? 

10.18 Let / : A n i— > A' n have g : A r i-> A s as a subfunction. Show that if g is (a, r, s,p)- 
independent for r < n and s < m, then so is /. Show that, as a consequence, the 
space S and time T needed to pebble the graph of a straight-line program for / satisfy 
the following inequality: 

\a(S+ 1)1 T> sp/4 

10.19 Show that if a function is (a, n, m,p) -independent, it is also (a, n, m, q) -independent 
for q < p. 

Hint: Consider the same set V of outputs in the two definitions. 
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10.20 A finite-state machine M computes the function f^ '■ Q x S" i— > $™ that maps 
the initial state in Q and an input string x of length n over the input alphabet S onto 
an output string y of the same length over the output alphabet \P. Such a machine 
can compute a function / : A" h- » .4" by associating inputs and outputs of / with 
inputs and outputs of f^f . A computation of an FSM M of a function / is input- 
output oblivious if the times at which inputs of / are read and its outputs produced 
are independent of the value of its input variables. 

Show that Theorem 10.4.1 can be generalized from straight-line computations to com- 
putations by input-output-oblivious FSMs. 

Hint: Try to parallel the proof of Theorem 10.4.1 using the FSM M instead of the 
pebble game. What correspondence can you make between the values under pebbles 
before the interval X and the state of M? Let log 2 \Q\, where Q is the set of states of 
M, be the measure of space associated with it. 

10.21 Give a design of an FSM that computes a function / from straight-line programs for it 
using a number of steps and storage locations proportional to the time and space used 
by a pebbling strategy for this straight-line program. 

Hint: Design the FSM so that it receives the inputs provided to the pebbling strategy 
as well as instructions to specify which operations are performed on the inputs and 
temporary storage locations of the FSM. 

TRANSITIVE FUNCTIONS 

10.22 Many functions for which space-time lower bounds have been derived are transitive. 
Such functions have the property that for subsets X and Y of their inputs and outputs, 
respectively, |X| = \Y\ = n, the (control) inputs not in X can be chosen so as to cause 
the outputs in Y to be equal to an arbitrary permutation drawn from the set G(n) 
of the inputs in X. For example, the cyclic shifting function studied in Section 2.5.2 
has a set of control inputs that specify the amount by which value inputs are permuted 
cyclically and assigned to the output variables. 

DEFINITION 10.13.2 LetG(n) be a group of 'permutations of the integers TN(n) = {0, 
1,2, . . .,n — 1}. That is, if n is in G(n), then it : N(n) i— > N(n). We denote byir{i) 
the integer to which integer i is mapped by it. A function fain) '■ A n+S i— > A n , where 
(t/n-l.- • -,yi,Vo) = fG(n){x n -i,- ■ .,X\,x , c s -\, ■ ■ ■ , Co), is said to have value in- 
puts X n —i, . . . ,Xi,Xq, control inputs C s _i, . . . , Co, and outputs y n —\, • ■ • > J/i, J/o- 
Such a function is transitive of order n with respect to the group G(n) if 

a) For each < i < n — 1 andO < j < n — 1, there exists a permutation tt G G(n) 
such that 7r(z) = j, and 

b) For each it £ G(n), there is an assignment to c s -\, . . . , Co such that y^U) = Xt for 
< i < n- 1. 

Show that every transitive function of order n with respect to the permutation group 
G(n), fa(n) '■ A n+S i— > A n , is (2, n + s,n, n/2) -independent. 

10.23 Show that the cyclic shifting function f^ c : £"+^"1 ^ B n defined in Sec- 
tion 2.5.2 is transitive of order n. 
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10.24 Consider the function fpj^Q : Ti? n *— > TV 1 whose value is the product PAQ of three 
nx n matrices P, A, and Q. Let P and Q be permutation matrices whose entries serve 
as control inputs. Show that fp^n is transitive of order n 2 . 



<•(«) 



10.25 The matrix inversion function fYr-i '■ 1Z n \— > lZ n maps a non-singular n x n matrix 

over the ring 1Z to its inverse. (See Section 6.3.) Show that /L-i is (2, n 2 , n, n/2)- 
independent. 

Hint: Show that f M -\ contains as a subfunction the function fp^n '■ 1Z <— > 72-™ 
defined in Problem 10.24. In this connection consider the following identity, which 
holds when the n x n matrices 7? and S are non-singular: 

M 



R 





A 

S 



R~ 







-Er l AS- 

s- 1 



PEBBLING SUPERCONCENTRATORS 

10.26 Show that the graph consisting of two n = 2 -input FFT graphs connected back 
to back (as shown in Fig. 10.24 with the second FFT graph reversed) is a supercon- 
centrator. (Valiant [343] has shown the existence of n-superconcentrators with 0(n) 
vertices.) 

Hint: Reason that there are unique vertex-disjoint paths from any r input vertices of 
this graph to any r consecutive vertices that are simultaneously outputs of the first 
FFT graph and the inputs to the reversed FFT graph. The first and last vertices are 
consecutive. 

10.27 Prove that to pebble any 5+1 outputs of an n-superconcentrator, S + 1 < n, from an 
initial placement of S pebbles requires that at least n — S different inputs be pebbled. 
Hint: Suppose that at most n — (S + 1) inputs are pebbled from an initial placement 
of S pebbles to pebble 5+1 outputs. Can you reason from the superconcentration 




Figure 1 0.24 Two back-to-back FFT graphs form a superconcentrator. 
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property that 5 + 1 or more inputs cannot remain unpebbled since 5+1 outputs are 
pebbled? 

10.28 Use the result of the previous problem to show that to pebble an n-superconcentrator 
with S pebbles in time T requires S and T to satisfy the following inequality: 

n 2 
(5+1)T >_ 

Hint: As in the proof of Theorem 10.4.1, divide time up into consecutive intervals. 
Choose the intervals so that each has the same number of outputs pebbled during it. 
Apply the results of the previous problem to obtain a lower bound on the sum of the 
number of input and output vertices that are pebbled during the interval. 

10.29 Show that the pebbling of two n-input back-to-back FFT graphs requires space and 
time that satisfy S T = Q(n ) and that this lower bound can be achieved up to a 
multiplicative factor. 

Hint: From the proof of Lemma 10.5.4 it follows that to pebble any 25 outputs with 
5 pebbles at least n — S + 1 inputs must be pebbled because if fewer inputs need be 
pebbled the outputs can have more values than is possible for the FFT. 



APPLICATIONS OF THE GRIGORIEV LOWER BOUND 

10.30 Show that there is a pebbling for a straight-line program for the cyclic shift func- 
tion /cyclic : B"+riognl ^ #n examine d in Section 10.5.2 for which (5 + 1)T = 
0(n 2 logn). 

Hint: Pebble the graph of the circuit described in Section 2.5.1. Construct a circuit for 
/cyclic tnat produces each output with 0{n log n) gates. 

10.31 Show that the binary addition function /^ (see Section 2.7) can be realized by a 
straight-line program using space and time satisfying ST = 0(n). 

10.32 Derive upper and lower bounds on the product (5 + \)T for pebblings of circuits for 
the squaring function / sq uarc that are within a factor of O (log n) of one another. 

10.33 Derive good upper and lower bounds on the product (5 + 1 )T for pebblings of circuits 
for the reciprocal function / rec j p . 

10.34 In Section 6.5.3 a straight-line algorithm is given to invert an n x n triangular matrix. 
Construct another straight-line algorithm based on it that can be pebbled with 0(n) 
pebbles to produce outputs by columns in 0(n 3 ) steps under the assumption that the 
standard matrix multiplication algorithm is used for the matrix multiplication steps. 
Hint: To produce outputs of a triangular matrix T by columns using the algorithm of 
Fig. 6.5, it is necessary to read the elements of T 2i i by rows and produce the outputs of 
T 2 ~ 2 by rows. Consider modifying this algorithm to generate the elements of the latter 
matrix first by rows and then by columns. 
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BRANCHING PROGRAMS 

10.35 Give a proof of Lemma 10.9.1 by a) designing a general branching program to simulate 
a comparison operator and b) using this design in a complete branching program that 
simulates a decision branching program. 

10.36 In Section 10.9 a procedure is given to convert a general branching program to a tree 
program without increasing the length of any path. Use this fact to show that every 
decision branching program with queries {<,=} that sorts a list of n items requires 
worst-case time of at least (n/2) log(n/2) when n is even. Show that this lower bound 
can be achieved up to a constant multiplicative factor. 

Hint: Show that every binary tree with m leaves must have a longest path of length 
at least log 2 m and determine the number of distinct leaves necessary in every decision 
branching program for sorting. 

THE BORODIN-COOK LOWER-BOUND METHOD 

10.37 The computation time of a branching program is the length of the longest path in its 
directed acyclic multigraph. Assume that a probability is assigned to each input x of 
length n. The average computation time, T, of a branching program is the sum of 
the lengths of the paths associated with different inputs weighted by the probabilities of 
these inputs. To compute the average space of a branching program with k vertices, the 
integers in the set {1,2, . . . , k} are assigned to the vertices of the branching program. 
The space associated with input x is the base-2 logarithm of the largest such integer 
encountered during the computation associated with x. The average space associated 
with a numbering of vertices is the average of this logarithm. The average space, S, 
associated with a branching program is the smallest average space over all numberings 
of vertices. 

Given a probability distribution on inputs of length n, let Cf(a, b) denote the maxi- 
mum over all those tree branching programs of depth a of the probability that b of the 
m outputs of the function / are computed correctly. Show that Theorem 10.11.1 can 
be generalized to the above probabilistic setting. 

Hint: If T is the average time of the branching program P, truncate the branching 
program at depth XT , call the new program P* , and show that P* solves the problem 
solved by P with probability at least 1/2. Also, show that with probability at least 1/2 
there exists a rich path in some stage that produces b = \mj o\ outputs. Let pi be 
the probability that the subtree with root i in some stage correctly produces b outputs. 
Now develop an upper bound in terms of the pi on the probability that some tree in 
some stage correctly produces b outputs. 

APPLICATIONS OF THE BORODIN-COOK LOWER BOUND 

10.38 Show that the branching program in Fig. 10.20 computes the inner product of two 3- 
element sequences over the set of integers moduIo-2; that is, the integers {0, 1} with 
the EXCLUSIVE-OR function for addition and the AND function for multiplication. 

10.39 Complete the proof of Theorem 10.13.2 by filling in the details of the construction of 
a branching program for integer multiplication for the middle range of space. 
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10.40 Complete the proof of Theorem 10.13.4 by showing that two n x n matrices can 
be multiplied with a hybrid algorithm that combines table lookup with the standard 
matrix multiplication algorithm on k x k blocks to achieve space and time satisfying 

ST 2 = 0{n i log \H\) 

10.41 Show that the RAM program described in Fig. 10.22 can be converted to a branching 
program of space O(S) and time 0(T). 



Chapter Notes 

The first formal study of space-time tradeoffs was made by Cobham [73] . He considered 
computations on one-tape Turing machines using as a space measure the logarithm of the 
number of configurations, and obtained quadratic lower bounds on the space-time product to 
recognize strings representing palindromes and perfect squares. 

The pebble-game model was implicitly used by Paterson and Hewitt [239] to study pro- 
gram schemas, uninterpreted graphs representing programs. They derived the space lower 
bound of Lemma 10.2.1, thereby demonstrating that recursive programs are more power- 
ful than nonrecursive ones. Cook [75,79] asked how much space (how many pebbles) was 
needed to execute a program schema with n vertices and obtained the result for pyramids of 
Lemma 10.2.2, showing that the minimum space is at least O(yn) for some schemas. The 
minimum-space question was answered by Hopcroft, Paul, and Valiant [140], who proved 
Theorem 10.7.1, and Paul, Tarjan, and Celoni [246], who obtained Theorem 10.8.1. The 
pebble model first formally appeared in [140]. Gilbert, Lengauer, and Tarjan [115] and Loui 
[205] have shown that the languages associated with minimal pebblings of DAGs (described 
at the end of Section 10.2) are PSPACE-complete. 

In addition to studying the minimum space needed for a computation, researchers also 
examined tradeoffs between space and time. Paterson and Hewitt [239] studied the conversion 
of a linear recursive program schema into a non-recursive one and demonstrated that the time 
needed satisfies T = 0(n 1+1 ' ^ ') for S > 2. (See Chandra [66] and Swamy and Savage 
[321]) for more details on this problem.) 

A number of other authors have identified graphs exhibiting non-trivial exchanges of space 
for time. Pippenger [254] gave a graph on n vertices for which T = f2(nloglogn) when 
S = 0(n/ log n), and Savage and Swamy [293] demonstrated that the FFT graph requires S 
and T satisfying ST = 0(n 2 ). (This is the first tradeoff result for a natural algorithm. Their 
upper bound is given in Theorem 10.5.5.) Later Tompa [333] and Reischuk [279] exhibited 
graphs requiring T = fl(nlogn) and T = f2(nlog n) for any integer t, respectively, when 
S = @(n/logn). 

Paul and Tarjan [245], Lingas [201], and van Emde Boas and van Leeuwen [349] gave 
graphs with T increasing from O(n) to T = 2 n ^ 1/2 \ T = 2 n ("' /3 ), and T = 2 n{ - nU ' lo & n \ 
respectively, when S drops by a constant amount from S = 0(n ' ), S = 0(n ' ) and 
S = 0(n 1 ' ), respectively. Theorem 10.3.1 is from [349], as is Problem 10.14. Carl- 
son and Savage [64] took a different tack and exhibited graphs for which T is superlinear, 
namely, T = 2 n ( lo s™ lo s lo s™) over a range of values of S, namely, O(logn) < S < 
0(n ' /logn). References to the worst-case exchange of space for time are given in Sec- 
tion 10.6. 
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Grigoriev [121] gave the first space-time lower bounds that apply to all graphs for a prob- 
lem (see Corollary 10.4.1), the essential idea ofwhich is generalized in Theorem 10.4.1. Savage 
[291] introduced the w(u, u)-flow measure used in this version of a theorem to derive lower 
bounds on area-time tradeoffs for VLSI algorithms. Grigoriev [121] also established Theo- 
rem 10.4.2 and derived a tradeoff lower bound on polynomial multiplication that is equiva- 
lent to Theorem 10.5.1 on convolution. The improved version of Theorem 10.4.2, namely 
Theorem 10.5.4, is original with this book. 

Lower bounds using the Grigoriev approach explicitly require that the sets over which 
functions are defined be finite. Tompa [331,332] eliminated the requirement for finite sets but 
required instead that functions be linear. Using concentrator properties of matrices deduced 
by Valiant [343] , Tompa derived a lower bound on ST for superconcentrators that he applied 
to matrix-vector multiplication and polynomial multiplication. He developed a similar lower 
bound for the DFT (See Abelson [2] for a generalization of some of these results to continuous 
functions.) The lower bound of Theorem 10.5.5 uses Tompa's DFT proof but does not require 
that straight-line programs be linear. 

The result on cyclic shift (Theorem 10.5.2) is due to Savage [292]. (This paper also gener- 
alizes Grigoriev's model to I/O-oblivious FSMs, extends Jaja's [147] space-time lower bound 
for matrix inversion, and derives space-time lower bounds for transitive functions and banded 
matrices.) The result on integer multiplication (Theorem 10.5.3) is due to Savage and Swamy 
[294], In [331] Tompa also obtained Theorem 10.5.6 on merging. Transitive functions de- 
fined in Problem 10.22 were introduced by Vuillemin [355]. 

In [333] Tompa examined the graph associated with the algorithm for transitive closure 
based on successive squarings described in Section 6A and demonstrated that it can be peb- 
bled either in a polynomial number of steps or with small space, namely 0(log n), but not 
both. Carlson [61] demonstrated that algorithms for convolution based on FFT graphs (see 
Section 6.7.4) require that T = 0(n 3 /S 2 + n 2 (log n) / S) , which doesn't come close to 
matching the lower bound of Theorem 10.5.1. However, through the judicious replacement 
of back-to-back FFT subgraphs in the standard convolution algorithm, Carlson [62] was able 
to achieve the bounds T = 0(nlog S + n 2 (log S)/S), which are optimal over all FFT-based 
convolution algorithms and nearly as good as the T = Q(n 2 /S) bounds. (See also [63].) 
Carlson and Savage [65] explored for a number of problems the size of the smallest graphs that 
can be pebbled with a small number of pebbles and demonstrated a tradeoff between size and 
space. 

Pippenger [251] has surveyed many of the results described above as well as those on the 
black-white pebble game described below. 

Several extensions of the pebble game have been developed. One of these is the red-blue 
pebble game discussed in Chapter 1 1 and its generalization, the memory hierarchy game. 
Another is the black-white pebble game whose rules are the following: a) a black pebble can be 
placed on an input vertex at any time and on a non-input vertex only if its predecessors carry 
pebbles, whether white or black; b) a black pebble may be removed at any time; c) a white 
pebble can be placed on a vertex at any time; d) a white pebble can be removed only if all its 
predecessors carry pebbles. The placement of white pebbles models a non-deterministic guess. 
The removal of a white vertex is allowed only when the guess has been verified. Questions 
this game makes possible are whether the minimum space required for a graph is lower with 
the black-white pebble game than with the standard game and whether for a given amount of 
space, the time required is lower. The black-white game was introduced by Cook and Sethi 
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[78], who showed that the minimum space for the pyramid graph is at least y/N/2 — 1. Meyer 
auf der Heide [222] proved that this minimum space is at most \n/2] + 2 and established in 
general that any graph with minimum space n in the black-white game has minimum space at 
most (n 2 — n)/2 + 1 in the standard game. The latter result is the pebbling analog of Savitch's 
theorem (Theorem 8.5.5). 

Loui [206] and Meyer auf der Heide [222] have shown that the minimum space with the 
black-white game is at least one half that for the standard pebble game for balanced trees, a 
result extended by Lengauer and Tarjan [196] to all trees and then by Klawe [167]. Wilber 
[363] has exhibited an infinite family of graphs for which the black-white minimum space is 
smaller than the minimum space with the standard game by more than a constant factor. 

All of the pebble games mentioned above are one-person games; that is, one person plays 
the game. A two-person game introduced by Venkateswaran and Tompa [352] models parallel 
complexity classes. Savage and Vitter [296] have also introduced a model of parallel pebbling. 

Branching programs have been known as binary decision diagrams for at least 30 years 
[15], although their importance to CAD was recognized only in the last 10 or 12 years. (See 
[60]). Branching programs were proposed as a vehicle for studying space-time problems by 
Pippenger and first studied by Tompa [331], who cites Pippenger for Lemma 10.9.2. Borodin, 
Fischer, Kirkpatrick, Lynch, and Tompa [55] derived a lower bound of ST = £l(n ) to 
sort n items with decision branching programs. Borodin and Cook [53] formulated the same 
problem in terms of the general branching programs of Section 10.9 and developed the general 
framework used in Theorem 10.11.1. 

Yesha [370] developed lower bounds on the space-time product with branching prob- 
lems for the discrete Fourier transform (see Theorem 10.13.7) and matrix multiplication over 
restricted domains. Abrahamson [6] (see also [4]) derived the lower bound on ST 2 in The- 
orem 10.13.4, thereby improving upon the matrix multiplication bound of Yesha. He also 
extended the Borodin-Cook model to probabilistic branching programs (see Problem 10.37) 
and derived the lower bound on ST for convolution (Theorem 10.13.1), integer multiplica- 
tion (Theorem 10.13.2), matrix-vector multiplication (Theorem 10.13.3), and matrix inver- 
sion (Theorem 10.13.6). He also developed a lower bound of f2(n 3 ) on ST to compute the 
product PAQ of three nxn matrices, where P and Q are permutation matrices. Abrahamson 
has also studied Boolean matrix multiplication in the general branching program model [5]. 
Beame [34] has obtained the result of Theorem 10.13.8 showing that the unique elements 
problem requires that ST = 0(n ) for general branching programs, which implies the lower 
bound on sorting stated in Theorem 10.13.9. 

In the comparison-based branching program model, Borodin, Fich, Meyer auf der Heide, 
Upfal, and Wigderson [54] derive the lower bound ST = £l{n i ' 1 \f\ogn) for the element- 
distinctness problem on n inputs. For the same computational model, Yao [369] improved 
this to ST = £l(n ( n '), where e(n) is a decreasing function of n. 




CHAPTER 



Memory-Hierarchy Tradeoffs 



Although serial programming languages assume that programs are written for the RAM model, 
this model is rarely implemented in practice. Instead, the random-access memory is replaced 
with a hierarchy of memory units of increasing size, decreasing cost per bit, and increasing 
access time. In this chapter we study the conditions on the size and speed of these units when 
a CPU and a memory hierarchy simulate the RAM model. The design of memory hierarchies 
is a topic in operating systems. 

A memory hierarchy typically contains the local registers of the CPU at the lowest level and 
may contain at succeeding levels a small, very fast, local random-access memory called a cache, 
a slower but still fast random-access memory, and a large but slow disk. The time to move data 
between levels in a memory hierarchy is typically a few CPU cycles at the cache level, tens of 
cycles at the level of a random-access memory, and hundreds of thousands of cycles at the disk 
level! A CPU that accesses a random-access memory on every CPU cycle may run at about 
a tenth of its maximum speed, and the situation can be dramatically worse if the CPU must 
access the disk frequently. Thus it is highly desirable to understand for a given problem how 
the number of data movements between levels in a hierarchy depends on the storage capacity 
of each memory unit in that hierarchy. 

In this chapter we study tradeoffs between the number of storage locations (space) at each 
memory-hierarchy level and the number of data movements (I/O time) between levels. Two 
closely related models of memory hierarchies are used, the memory-hierarchy pebble game and 
the hierarchical memory model, which are extensions of those introduced in Chapter 10. 

In most of this chapter it is assumed not only that the user has control over the I/O algo- 
rithm used for a problem but that the operating system does not interfere with the I/O oper- 
ations requested by the user. However, we also examine I/O performance when the operating 
system, not the user, controls the sequence of memory accesses (Section 11.10). Competi- 
tive analysis is used in this case to evaluate two-level LRU and FIFO memory-management 
algorithms. 
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ll.l The Red-Blue Pebble Game 

The red-blue pebble game models data movement between adjacent levels of a two-level mem- 
ory hierarchy. We begin with this model to fix ideas and then introduce the more general 
memory-hierarchy game. Both games are played on a directed acyclic graph, the graph of a 
straight-line program. We describe the game and then give its rules. 

In the red-blue game, (hot) red pebbles identify values held in a fast primary memory 
whereas (cold) blue pebbles identify values held in a secondary memory. The values identified 
with the pebbles can be words or blocks of words, such as the pages used by an operating 
system. Since the red-blue pebble game is used to study the number of I/O operations necessary 
for a problem, the number of red pebbles is assumed limited and the number of blue pebbles is 
assumed unlimited. Before the game starts, blue pebbles reside on all input vertices. The goal 
is to place a blue pebble on each output vertex, that is, to compute the values associated with 
these vertices and place them in long-term storage. These assumptions capture the idea that 
data resides initially in the most remote memory unit and the results must be deposited there. 

RED-BLUE PEBBLE GAME 

• (Initialization) A blue pebble can be placed on an input vertex at any time. 

• (Computation Step) A red pebble can be placed on (or moved to) a vertex if all its imme- 
diate predecessors carry red pebbles. 



• 



• 



(Pebble Deletion) A pebble can be deleted from any vertex at any time. 
(Goal) A blue pebble must reside on each output vertex at the end of the game. 

• (Input from Blue Level) A red pebble can be placed on any vertex carrying a blue pebble. 

• (Output to Blue Level) A blue pebble can be placed on any vertex carrying a red pebble. 

The first rule (initialization) models the retrieval of input data from the secondary mem- 
ory. The second rule (a computation step) is equivalent to requiring that all the arguments 
on which a function depends reside in primary memory before the function can be computed. 
This rule also allows a pebble to move (or slide) to a vertex from one of its predecessors, mod- 
eling the use of a register as both the source and target of an operation. The third rule allows 
pebble deletion: if a red pebble is removed from a vertex that later needs a red pebble, it must 
be repebbled. 

The fourth rule (the goal) models the placement of output data in the secondary memory 
at the end of a computation. The fifth rule allows data held in the secondary memory to be 
moved back to the primary memory (an input operation). The sixth rule allows a result to 
be copied to a secondary memory of unlimited capacity (an output operation). Note that a 
result may be in both memories at the same time. 

The red-blue pebble game is a direct generalization of the pebble game of Section 10.1 
(which we call the red pebble game), as can be seen by restricting the sixth rule to allow 
the placement of blue pebbles only on vertices that are output vertices of the DAG. Under 
this restriction the blue level cannot be used for intermediate results and the goal of the game 
becomes to minimize the number of times vertices are pebbled with red pebbles, since the 
optimal strategy pebbles each output vertex once. 
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A pebbling strategy V is the execution of the rules of the pebble game on the vertices of 
a graph. We assign a step to each placement of a pebble, ignoring steps on which pebbles are 
removed, and number the steps consecutively. The space used by a strategy V is defined as 
the maximum number of red pebbles it uses. The I/O time, T 2 , oiV on the graph G is the 
number of input and output (I/O) steps used by V ■ The computation time, T\, is the number 
of computation steps of V on G. Note that time in the red pebble game is the time to place red 
pebbles on input and internal vertices; in this chapter the former are called I/O operations. 

Since accesses to secondary memory are assumed to require much more time than accesses 

to primary memory, a minimal pebbling strategy, V m i n , performs the minimal number of 

I/O operations on a graph G for a given number of red pebbles and uses the smallest number 

of red pebbles for a given I/O time. Furthermore, such a strategy also uses the smallest number 

/,\ 
of computation steps among those meeting the other requirements. We denote by Tj (S, G) 

and T 2 (S, G) the number of computation and I/O steps in a minimal pebbling of G in the 
red-blue pebble game with S red pebbles. 

The minimum number of red pebbles needed to play the red-blue pebble game is the 
maximum number of predecessors of any vertex. This follows because blue pebbles can be used 
to hold all intermediate results. Thus, in the FFT graph of Fig. 11.1 only two red pebbles are 
needed, since one of them can be slid to the vertex being pebbled. However, if the minimum 
number of pebbles is used, many expensive I/O operations are necessary. 

In Section 1 1.2 we generalize the red-blue pebble game to multiple levels and consider two 
variants of the model, one in which all levels including the highest can be used for intermediate 
storage, and a second in which the highest level cannot be used for intermediate storage. The 
second model (the I/O-limited game) captures aspects of the red-blue pebble game as well as 
the red pebble game of Chapter 10. 

An important distinction between the pebble game results obtained in this chapter and 
those in Chapter 10 is that here lower bounds are generally derived for particular graphs, 
whereas in Chapter 10 they are obtained for all graphs of a problem. 




Figure I I . I An eight-input FFT graph showing three two-input FFT subgraphs. 



532 Chapter 1 1 Memory-Hierarchy Tradeoffs Models of Computation 

11.1.1 Playing the Red-Blue Pebble Game 

The rules for the red-blue pebble game are illustrated by the eight-input FFT graph shown in 
Fig. 11.1. If S = 3 red pebbles are available to pebble this graph (at least S = 4 pebbles are 
needed in the one-pebble game), a pebbling strategy that keeps the number of I/O operations 
small is based on the pebbling of sub-FFT graphs on two inputs. Three such sub-FFT sub- 
graphs are shown by heavy lines in Fig. 11.1, one at each level of the FFT graph. This pebbling 
strategy uses three red pebbles to place blue pebbles on the outputs of each of the four lowest- 
level sub-FFT graphs on two inputs, those whose outputs are second-level vertices of the full 
FFT graph. (Thus, eight blue pebbles are used.) Shown on a second-level sub-FFT graph are 
three red pebbles at the time when a pebble has just been placed on the first of the two outputs 
of this sub-FFT graph. This strategy performs two I/O operations for each vertex except for 
input and output vertices. A small savings is possible if, after pebbling the last sub-FFT graph 
at one level, we immediately pebble the last sub-FFT graph at the next level. 

11.1.2 Balanced Computer Systems 

A balanced computer system is one in which no computational unit or data channel becomes 
saturated before any other. The results in this chapter can be used to analyze balance. To 
illustrate this point, we examine a serial computer system consisting of a CPU with a random- 
access memory and a disk storage unit. Such a system is balanced for a particular problem if 
the time used for I/O is comparable to the time used for computation. 

As shown in Section 1 1.5.2, multiplying two n x n matrices with a variant of the classical 
matrix multiplication algorithm requires a number of computations proportional to n 3 and a 
number of I/O operations proportional to n 3 /\/S, where S is the number of red pebbles or 
the capacity of the random-access memory. Let to and t\ be the times for one computation 
and I/O operation, respectively. Then the system is balanced when ton ss t\n /y S, Let the 
computational and I/O capacities, C com p and C\/q, be the rates at which the CPU and disk 
can compute and exchange data, respectively; that is, C comp = 1/to and C\/q = 1/ti. Thus, 
balance is achieved when the following condition holds: 

L-I/O 

From this condition we see that if through technological advance the ratio C comp /Ci/o in- 
creases by a factor [3, then for the system to be balanced the storage capacity of the system, S, 
must increase by a factor (3 . 

Hennessy and Patterson [132, p. 427] observe that CPU speed is increasing between 50% 
and 100% per year while that of disks is increasing at a steady 7% per year. Thus, if the ratio 
C'comp/Cl/o f° r our simple computer system grows by a factor of 50/7 ~ 7 per year, then 
S must grow by about a factor of 49 per year to maintain balance. To the extent that matrix 
multiplication is typical of the type of computing to be done and that computers have two- 
level memories, a crisis is looming in the computer industry! Fortunately, multi-level memory 
hierarchies are being introduced to help avoid this crisis. 

As bad as the situation is for matrix multiplication, it is much worse for the Fourier trans- 
form and sorting. For each of these problems the number of computation and I/O operations 
is proportional to n log 2 n and n log 2 nj log 2 S, respectively (see Section 11.5.3). Thus, bal- 
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ance is achieved when 

7T— - ~ lo S2 S 

L-I/O 

Consequently, if C comp /Ci/Q increases by a factor (3, S must increase to S^ . Under the 
conditions given above, namely, f3 ft! 7, a balanced two-level memory-hierarchy system for 
these problems must have a storage capacity that grows from S to about S 7 every year. 

11.2 The Memory-Hierarchy Pebble Game 

The standard memory-hierarchy game (MHG) defined below generalizes the two-level red- 
blue game to multiple levels. The L-level MHG is played on directed acyclic graphs with pi 
pebbles at level /, 1 < I < L — 1, and an unlimited number of pebbles at level L. When 
L = 2, the lower level is the red level and the higher is the blue level. The number of pebbles 
used at the L — \ lowest levels is recorded in the resource vector p = (p\,P2, ■ ■ ■ ,Pl-i)> 
where pj > 1 for 1 < j ' < L — 1 . The rules of the game are given below. 

STANDARD MEMORY-HIERARCHY GAME 

Rl . (Initialization) A level-L pebble can be placed on an input vertex at any time. 

R2. (Computation Step) A first-level pebble can be placed on (or moved to) a vertex if all its 
immediate predecessors carry first-level pebbles. 

R3. (Pebble Deletion) A pebble of any level can be deleted from any vertex. 

R4. (Goal) A level- L pebble must reside on each output vertex at the end of the game. 

R5. (Input from Level /) For 2 < / < L, a level-(Z — 1) pebble can be placed on any vertex 
carrying a level-/ pebble. 

R6. (Output to Level I) For 2 < I < L, a level-/ pebble can be placed on any vertex carrying a 
level-(Z — 1) pebble. 

The first four rules are exactly as in the red-blue pebble game. The fifth and sixth rules general- 
ize the fifth and sixth rules of the red-blue pebble game by identifying inputs from and outputs 
to level-/ memory. These last two rules allow a level-/ memory to serve as temporary storage 
for lower-level memories. 

In the standard MHG, the highest-level memory can be used for storing intermediate 
results. An important variant of the MHG is the I/O-limited memory-hierarchy game, in 
which the highest level memory cannot be used for intermediate storage. The rules of this 
game are the same as in the MHG except that rule R6 is replaced by the following two rules: 

I/O-LIMITED MEMORY-HIERARCHY GAME 

R6. (Output to Level /) For 2 < / < L — 1, a level-/ pebble can be placed on any vertex 
carrying a level-(Z — 1) pebble. 

R7. (I/O Limitation) Level-L pebbles can only be placed on output vertices carrying level- 
(L- 1) pebbles. 
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The sixth and seventh rules of the new game allow the placement of level-L pebbles only on 
output vertices. The two-level version of the I/O-limited MHG is the one-pebble game studied 
in Chapter 10. As mentioned earlier, we call the two-level I/O-limited MHG the red pebble 
game to distinguish it from the red-blue pebble game and the MHG. Clearly the multi-level 
I/O-limited MHG is a generalization of both the standard MHG and the one-pebble game. 

The I/O-limited MHG models the case in which accesses to the highest level memory take 
so long that it should be used only for archival storage, not intermediate storage. Today disks 
are so much slower than the other memories in a hierarchy that the I/O-limited MHG is the 
appropriate model when disks are used at the highest level. 

The resource vector p = (p\,p2, . . . ,pl-\) associated with a pebbling strategy V speci- 
fies the number of /-level pebbles, pi, used by V . We say thatpz is the space used at level / by 
V . We assume that pi > 1 for 1 < / < L, so that swapping between levels is possible. The 
I/O time at leveH with pebbling strategy V and resource vector p, T ; (p,G,V),2 < I < L, 
with both versions of the MHG is the number of inputs from and outputs to level I. The com- 
putation time with pebbling strategy V and resource vector p, Tj (p, G, V), in the MHG 
is the number of times first-level pebbles are placed on vertices by V . Since there is little risk of 
confusion, we use the same notation, T ; (p, G,V), in the standard and I/O-limited MHG 
for the number of computation and I/O steps. 

The definition of a minimal MHG pebbling is similar to that for a red-blue pebbling. 
Given a resource vector p, "P m i n is a minimal pebbling for an L-level MHG if it minimizes 
the I/O time at level L, after which it minimizes the I/O time at level L — 1, continuing in 
this fashion down to level 2. Among these strategies it must also minimize the computation 
time. This definition of minimality is used because we assume that the time needed to move 
data between levels of a memory hierarchy grows rapidly enough with increasing level that it is 
less costly to repebble vertices at or below a given level than to perform an I/O operation at a 
higher level. 




Figure I 1 .2 Pebbling an eight-input FFT graph in the three-level MHG. 
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11.2.1 Playing the MHG 

Figure 1 1.2 shows the FFT graph on eight inputs being pebbled in a three-level MHG with 
resource vector p = (2, 4). Here black circles denote first-level pebbles, shaded circles denote 
second-level pebbles and striped circles denote third-level pebbles. Four striped, three shaded 
and two black pebbles reside on vertices in the second row of the FFT. One of these shaded 
second-level pebbles shares a vertex with a black first-level pebble, so that this black pebble can 
be moved to the vertex covered by the open circle without deleting all pebbles on the doubly 
covered vertex. 

To pebble the vertex under the open square with a black pebble, we reuse the black pebble 
on the open circle by swapping it with a fourth shaded pebble, after which we place the black 
pebble on the vertex that was doubly covered and then slide it to the vertex covered by the 
open box. This graph can be completely pebbled with the resource vector p = (2, 4) using 
only four third-level pebbles, as the reader is asked to show. (See Problem 1 1.3.) Thus, it can 
also be pebbled in the four-level I/O-limited MHG using resource vector p = (2, 4, 4) . 



11.3 l/O-Time Relationships 



The following simple relationships follow from two observations. First, each input and output 
vertex must receive a pebble at each level, since every input must be read from level L and 
every output must be written to level L. Second, at least one computation step is needed for 
each non-input vertex of the graph. Here we assume that every vertex in V must be pebbled 
to pebble the output vertices. 

LEMMA I 1 .3. 1 Let a be the maximum in-degree of any vertex in G = (V , E) and let In(G) 
and Out(G) be the sets of input and output vertices of G, respectively. Then any pebblingV of G 
with the MHG, whether standard or I/O-limited, satisfies the following conditions for 2 < I < L: 

Tl L \p,G,V) > \In{G)\+\Out{G)\ 
T{ L \p,G,V)>\V\-\In(G)\ 

The following theorem relates the number of moves in an L-level game to the number in 
a two-level game and allows us to use prior results. The lower bound on the level-/ I/O time 
is stated in terms of s;_i because pebbles at levels 1,2,...,/— 1 are treated collectively as red 
pebbles to derive a lower bound; pebbles at level / and above are treated as blue pebbles. 

THEOREM I 1.3.1 Let si = Y] ■ 1 Pj. Then the following inequalities hold for every L-level 

(2) 



standard MHG pebbling strategy V forG, where p is the resource vector used by V and T\ (S,G) 
andT\ (S,G) are the number of computation and I/O c 
the red-blue pebble game played on G with S red pebbles: 



andT\ (S, G) are the number of computation and I/O operations used by a minimal pebbling in 



T[ h \p,G,V) > TP(si-uG) forl<l<L 

Also, the following lower bound on computation time holds for all pebbling strategies V in the 
standard MHG: 

T[ L \p,G,V) >Ti (2) ( Sl ,G), 
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In the I/O-limited case the following lower bounds apply, where a is the maximum fan-in of any 
vertex of G: 

Tl L \p,G,V) > TP(si-uG) for2<l<L 

T[ L \p,G,V)>4 2) ( SL ^,G)/a 

Proof The first set of inequalities is shown by considering the red-blue game played with 
S = si-i red pebbles and an unlimited number of blue pebbles. The S red pebbles and 
sl-i — S blue pebbles can be classified into L — 1 groups with pj pebbles in the jth 
group, so that we can simulate the steps of an L-level MHG pebbling strategy V . Because 
there are constraints on the use of pebbles in V ', this strategy uses a number of leveW I/O 
operations that cannot be larger than the minimum number of such I/O operations when 
pebbles at level I — 1 or less are treated as red pebbles and those at higher levels are treated 
as blue pebbles. Thus, T ; \p,G,V) > T 2 (s;_i,G). By similar reasoning it follows that 
t[ L \p,G,V)>t[ 2 \si,G). 

In the above simulation, blue pebbles simulating levels I and above cannot be used arbi- 
trarily when the I/O-limitation is imposed. To derive lower bounds under this limitation, we 
classify S = «l-i pebbles into L — 1 groups with Pj pebbles in the jth group and simulate 
in the red-blue pebble game the steps of an L-level I/O-limited MHG pebbling strategy V . 
The I/O time at level I is no more than the I/O time in the two-level I/O-limited red-blue 
pebble game in which all S red pebbles are used at level / — 1 or less. 

Since the number of blue pebbles is unlimited, in a minimal pebbling all I/O operations 
consist of placing of red pebbles on blue-pebbled vertices. It follows that if T I/O operations 
are performed on the input vertices, then at least T placements of red pebbles on blue- 
pebbled vertices occur. Since at least one internal vertex must be pebbled with a red pebble 
in a minimal pebbling for every a input vertices that are red-pebbled, the computation time 
is at least T/a. Specializing this to T = T 2 (sl-i, G) for the I/O-limited MHG, we have 
the last result. ■ 

It is important to note that the lower bound to Tj (S, G, V) for the I/O-limited case is 
not stated in terms of | V |, because \V\ may not be the same for each values of S. Consider the 
multiplication of two n x n matrices. Every graph of the standard algorithm can be pebbled 
with three red pebbles, but such graphs have about 2n 3 vertices, a number that cannot be 
reduced by more than a constant factor when a constant number of red pebbles is used. (See 
Section 11.5.2.) On the other hand, using the graph of Strassen's algorithm for this problem 
requires at least f2(n 38529 ) pebbles, since it has O(n 2S07 ) vertices. 

We close this section by giving conditions under which lower bounds for one graph can 
be used for another. Let a reduction of DAG G\ = (V\,E\) be a DAG G = (Vq,Eq), 
Vq C V\ and Eq C E\, obtained by deleting edges from E\ and coalescing the non-terminal 
vertices on a "chain" of vertices in V\ into the first vertex on the chain. A chain is a sequence 
V\, V2, ■ ■ ■ , v r of vertices such that, for 2 < i < r — 1, Vi is adjacent to Vi-\ and Vi + \ and no 
other vertices. 

LEMMA I 1 .3.2 Let Gq be a reduction ofG\. Then for any minimal pebbling "P m i n and 1 < 
I < L, the following inequalities hold: 

Tl L \p, GuVmin) > Tt L) (p,G ,V miu ) 
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Proof Any minimal pebbling strategy for G\ can be used to pebble Go by simulating moves 
on a chain with pebble placements on the vertex to which vertices on the chain are coalesced 
and by honoring the edge restrictions of G\ that are removed to create Go. Since this strategy 
for G\ may not be minimal for Go, the inequalities follow. ■ 

11.4 The Hong-Kung Lower-Bound Method 

In this section we derive lower limits on the I/O time at each level of a memory hierarchy 
needed to pebble a directed acyclic graph with the MHG. These results are obtained by com- 
bining the inequalities of Theorem 11.3.1 with a lower bound on the I/O and computation 
time for the red-blue pebble game. 

Theorem 10.4.1 provides a framework that can be used to derive lower bounds on the I/O 
time in the red-blue pebble game. This follows because the lower bounds of Theorem 10.4.1 
are stated in terms of 7j, the number of times inputs are pebbled with S red pebbles, which 
is also the number of I/O operations on input vertices in the red-blue pebble game. It is 
important to note that the lower bounds derived using this framework apply to every straight- 
line program for a problem. 

In some cases, for example matrix multiplication, these lower bounds are strong. However, 
in other cases, notably the discrete Fourier transform, they are weak. For this reason we intro- 
duce a way to derive lower bounds that applies to a particular graph of a problem. If that graph 
is used for the problem, stronger lower bounds can be derived with this method than with the 
techniques of Chapter 10. We begin by introducing the S'-span of a DAG. 

DEFINITION I 1 .4. 1 Given a DAG G = (V, E), the S-span of G, p(S, G), is the maximum 
number of vertices of G that can be pebbled with S pebbles in the red pebble game maximized over 
all initial placements of S red pebbles. (The initialization rule is disallowed.) 

The following is a slightly weaker but simpler version of the Hong-Kung [137] lower 
bound on I/O time for the two-level MHG. This proof divides computation time into con- 
secutive intervals, just as was done for the space-time lower bounds in the proofs of Theo- 
rems 10.4.1 and 10.11.1. 

THEOREM I 1 .4. 1 For every pebbling V of the DAG G = (V, E) in the red-blue pebble game 

(2) 

with S red pebbles, the I/O time used, T 2 (S,G, V), satisfies the following lower bound: 

\4 2) {S,G)IS-\p{2S,G)>\V\-\In{G)\ 
Proof Divide V into consecutive sequential sub-pebblings {P\,V2, ■ ■ ■ ,Vh)> where each 
sub-pebbling has S I/O operations except possibly the last, which has no more such opera- 
tions. Thus, h = \TP(S, G, P)/ff] . 

We now develop an upper bound Q to the number of vertices of G pebbled with red 
pebbles in any sub-pebbling Vj. This number multiplied by the number h of sub-pebblings 
is an upper bound to the number of vertices other than inputs, \V\ — \In(G)\, that must be 
pebbled to pebble G. It follows that 

Qh> \V\-\In(G)\ 

The upper bound on Q is developed by adding S new red pebbles and showing that 
we may use these new pebbles to move all I/O operations in a sub-pebbling Vt to either 
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the beginning or the end of the sub-pebbling without changing the number of computation 
steps or I/O operations. Thus, without changing them, we move all computation steps to a 
middle interval oiVt, between the higher-level I/O operations. 

We now show how this may be done. Consider a vertex v carrying a red pebble at some 
time during Vt that is pebbled for the first time with a blue pebble during Vt (vertex 7 at 
step 11 in Fig. 11.3). Instead of pebbling v with a blue pebble, use a new red pebble to 
keep a red pebble on v. (This is equivalent to swapping the new and old red pebbles on v.) 
This frees up the original red pebble to be used later in the sub-pebbling. Because we attach 
a red pebble to v for the entire pebbling Vt, all later output operations from v in Vt can 
be deleted except for the last such operation, if any, which can be moved to the end of the 
interval. Note that if after v is given a blue pebble in V, it is later given a red pebble, this red 
pebbling step and all subsequent blue pebbling steps except the last, if any, can be deleted. 
These changes do not affect any computation step in Vt- 

Consider a vertex v carrying a blue pebble at the start of Vt that later in Vt is given a 
red pebble (see vertex 4 at step 12 in Fig. 11.3). Consider the first pebbling of this kind. 
The red pebble assigned to v may have been in use prior to its placement on v. If a new 
red pebble is used for v, the first pebbling of v with a red pebble can be moved toward 
the beginning of Vt so that, without violating the precedence conditions of G, it precedes 
all placements of red pebbles on vertices without pebbles. Attach this new red pebble to v 
during Vt- Subsequent placements of red pebbles on v when it carries a blue pebble during 
Vt, if any, are thereby eliminated. 
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Figure I 1 .3 The vertices of an FFT graph are numbered and a pebbling schedule is given in 
which the two numbered red pebbles are used. Up (down) arrows identify steps in which an 
output (input) occurs; other steps are computation steps. Steps 10 through 13 of the schedule Vt 
contain two I/O operations. With two new red pebbles, the input at step 12 can be moved to the 
beginning of the interval and the output at step 1 1 can be moved after step 13. 
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We now derive an upper bound to Q. At the start of the pebbling of the middle interval 
of Vt there are at most 2S red pebbles on G, at most S original red pebbles plus S new red 
pebbles. Clearly, the number of vertices that can be pebbled in the middle interval with first- 
level pebbles is largest when all 25* red pebbles on G are allowed to move freely. It follows 
that at most p(2S, G) vertices can be pebbled with red pebbles in any interval. Since all 
vertices must be pebbled with red pebbles, this completes the proof. ■ 

Combining Theorems 11.3.1 andll.4.1 and a weak lower limit on the size of T ; (p,G), 
we have the following explicit lower bounds to T ; (p, G) . 

COROLLARY I 1 .4. 1 In the standard MHG when T^ L \p,G) > /3(s;_i - I) for P > 1, the 

following inequality holds for 2 < I < L: 

Tl L \p,G) > JL "- 1 (\V\ - \In(G)\) 
P+ 1 p(2si-i,G) 

In the I/O-limited MHG when T} (p,G) > /3(s;_i — 1) for (3 > 1, the following inequality 
holds for 2 < I < L: 

11.5 Tradeoffs Between Space and I/O Time 

We now apply the Hong-Kung method to a variety of important problems including matrix- 
vector multiplication, matrix-matrix multiplication, the fast Fourier transform, convolution, 
and merging and permutation networks. 

11.5.1 Matrix- Vector Product 

We examine here the matrix-vector product function f ^J, : R n +n \— > R n over a commutative 
ring TZ described in Section 6.2. 1 primarily to illustrate the development of efficient multi- 
level pebbling strategies. The lower bounds on I/O and computation time for this problem 
are trivial to obtain. For the matrix-vector product, we assume that the graphs used are those 
associated with inner products. The inner product u • v of n-vectors u and v over a ring TZ 
is defined by: 

n 

u • v = y m ■ Vi 

The graph of a straight-line program to compute this inner product is given in Fig. 1 1.4, where 
the additions of products are formed from left to right. 

The matrix-vector product is defined here as the pebbling of a collection of inner product 
graphs. As suggested in Fig. 1 1 .4, each inner product graph can be pebbled with three red 
pebbles. 

THEOREM I 1 .5. 1 Let G be the graph of a straight-line program for the product of the matrix A 
with the vector x. Let G be pebbled in the standard MHG with the resource vector p. There is a 
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Figure I 1 .4 The graph of an inner product computation showing the order in which vertices 
are pebbled. Input vertices are labeled with the entries in the matrix A and vector x that are 
combined. Open vertices are product vertices; those above them are addition vertices. 



pebbling strategy V of G with pi > 1 for 2 < I < L — 1 andp\ > 3 such that T{ (p, G, V) 
2n 2 — n, the minimum value, and the following bounds hold simultaneously: 



In < tI L) {p,G,V) < In 1 



,{L 



Proof The lower bound T ; \p,G,V) > n +2n, 1 < I < L, follows from Lemma 1 1.3.1 
because there are n 2 + n inputs and n outputs to the matrix-vector product. The upper 
bounds derived below represent the number of operations performed by a pebbling strategy 
that uses three level- 1 pebbles and one pebble at each of the other levels. 

Each of the n results of the matrix-vector product is computed as an inner product in 
which successive products dijXj are formed and added to a running sum, as suggested by 
Fig. 11.4. Each of the n 2 entries of the matrix A (leaves of inner product trees) is used in 
one inner product and is pebbled once at levels L, L — 1, . . . , 1 when needed. The n entries 
in x are used in every inner product and are pebbled once at each level for each of the n 
inner products. First-level pebbles are placed on each vertex of each inner product tree in the 
order suggested in Fig. 11.4. After the root vertex of each tree is pebbled with a first-level 
pebble, it is pebbled at levels 2, . . . , L. 

It follows that one I/O operation is performed at each level on each vertex associated 
with an entry in A and the outputs and that n I/O operations are performed at each level 
on each vertex associated with an entry in x, for a total of 2n 2 + n I/O operations at each 
level. This pebbling strategy places a first-level pebble once on each interior vertex of each 
of the n inner product trees. Such trees have 2n — 1 internal vertices. Thus, this strategy 
takes 2n 2 — n computation steps. ■ 

As the above results demonstrate, the matrix-vector product is an example of an I/O- 
bounded problem, a problem for which the amount of I/O required at each level in the 
memory hierarchy is comparable to the number of computation steps. Returning to the dis- 
cussion in Section 11.1.2, we see that as CPU speed increases with technological advances, a 
balanced computer system can be constructed for this problem only if the I/O speed increases 
proportionally to CPU speed. 

The I/O-limited version of the MHG for the matrix-vector product is the same as the 
standard version because only first-level pebbles are used on vertices that are neither input or 
output vertices. 



©John E Savage 1 1 .5 Tradeoffs Between Space and I/O Time 541 

11.5.2 Matrix-Matrix Multiplication 

In this section we derive upper and lower bounds on exchanges between I/O time and space 
for the n x n matrix multiplication problem in the standard and I/O-limited MHG. We show 
that the lower bounds on computation and I/O time can be matched by efficient pebbling 
strategies. 

Lower bounds for the standard MHG are derived for the family T n of inner product 
graphs for n x n matrix multiplication, namely, the set of graphs to multiply two nx n ma- 
trices using just inner products to compute entries in the product matrix. (See Section 6.2.2.) 
We allow the additions in these inner products to be performed in any order. 

The lower bounds on I/O time derived below for the I/O-limited MHG apply to all DAGs 
for matrix multiplication. Since these DAGs include graphs other than the inner product trees 
in T n , one might expect the lower bounds for the I/O-limited case to be smaller than those 
derived for graphs in T n . However, this is not the case, apparently because efficient pebbling 
strategies for matrix multiplication perform I/O operations only on input and output vertices, 
not on internal vertices. The situation is very different for the discrete Fourier transform, as 
seen in the next section. 

We derive results first for the red-blue pebble game, that is, the two-level MHG, and then 
generalize them to the multi-level MHG. We begin by deriving an upper bound on the S'-span 
for the family of inner product matrix multiplication graphs. 

LEMMA I 1 .5. 1 For every graph G € T n the S-span p(S, G) satisfies the bound p(S, G) < 
2S^ 2 forS < n 2 . 

Proof p(S, G) is the maximum number of vertices of G G T n that can be pebbled with 
S red pebbles from an initial placement of these pebbles, maximized over all such initial 
placements. Let A, B, and C be n x n matrices with entries {a%,j}, {b%,j}, and {cj,j}, 
respectively, where 1 < i,j < n. Let C = A x B. The term Cij = /Cfc a i,fc^fc,j ls 
associated with the root vertex in of a unique inner product tree. Vertices in this tree are 
either addition vertices, product vertices associated with terms of the form a^kbkj, or input 
vertices associated with entries in the matrices A and B. Each product term Oi,kbkj is 
associated with a unique term Cjj and tree, as is each addition operator. 

Consider an initial placement of S < n 2 pebbles of which r are in addition trees (they 
are on addition or product vertices). Let the remaining S — r pebbles reside on input 
vertices. Let p be the number of product vertices that can be pebbled from these pebbled 
inputs. We show that at most p + r — 1 additional pebble placements are possible from the 
initial placement, giving a total of at most tt = 2p + r — 1 pebble placements. (Figure 11.5 
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(a) (b) (c) (d) 

Figure I 1.5 Graph of the inner products used to form the product of two 2x2 matrices. 
(Common input vertices are repeated for clarity.) 
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shows a graph G for a 2 x 2 matrix multiplication algorithm in which the product vertices 
are those just below the output vertices. The black vertices carry pebbles. In this example 
r = 2 and p = 1 . While p + r — 1 = 2, only one pebble placement is possible on addition 
trees in this example.) 

Given the dependencies of graphs in T n , there is no loss in generality in assuming that 
product vertices are pebbled before pebbles are advanced in addition trees. It follows that at 
most p + r addition- tree vertices carry pebbles before pebbles are advanced in addition trees. 
These pebbled vertices define subtrees of vertices that can be pebbled from the p + r initial 
pebble placements. Since a binary tree with n leaves has n — 1 non-leaf nodes, it follows 
that if there are t such trees, at most p+r — t pebble placements will be made, not counting 
the original placement of pebbles. This number is maximized at t = 1. (See Problem 1 1.9.) 

We now complete the proof by deriving an upper bound on p. Let A be the 0— lnxn 
matrix whose (i, j) entry is 1 if the variable in the (i,j) position of the matrix A carries a 
pebble initially and otherwise. Let B be similarly defined for B. It follows that the (i,j) 
entry, Sij, of the matrix product C = A X B, where addition and multiplication are over 
the integers, is equal to the number of products that can be formed that contribute to the 
(i,j) entry of the result matrix C . Thusp = ^. ■ Sij. We now show thatp < \>S(S 



•r 



Let A and B have a and b 1 s, respectively, where a + b = S — r. There are at most a/ a 
rows of A containing at least a Is. The maximum number of products that can be formed 
from such rows is ab/a because each 1 in B combine with a 1 in each of these rows. Now 
consider the product of other rows of A with columns of B. At most S such row-column 
inner products are formed since at most S outputs can be pebbled. Since each of them 
involves a row with at most a l's, at most aS products of pairs of variables can be formed. 
Thus, a total of at most p = ab/a + aS products can be formed. We are free to choose 
a to minimize this sum (a = \/ab/S does this) but must choose a and b to maximize it 
(a = (S — r) jl satisfies this requirement). The result is that p < vS(S — r). We complete 
the proof by observing that ir = 2p + r — 1 < 2VSS for r > 0. ■ 

Theorem 1 1.5.2 states bounds that apply to the computation and I/O time in the red-blue 
pebble game for matrix multiplication. 

THEOREM I 1 .5.2 For every graph G in the family T n of inner product graphs for multiplying 
two n x n matrices and for every pebbling strategy V for G in the red-blue pebble game that 
uses S > 3 red pebbles, the computation and I/O-time satisfy the following lower bounds: 



T[ 1 \s,G,V) = n{n i ) 
4 2 \s,G,V) = Cl(^= 

Furthermore, there is a pebbling strategy V for G with S > 3 red pebbles such that the following 
upper bounds hold simultaneously: 

T l (2) (S,G,V) = 0{n 3 ) 

T?\s,G,V) = o(^= 

The lower bound on I/O time stated above applies for every graph of a straight-line program for 
matrix multiplication in the I/O-limited red-blue pebble game. The upper bound on I/O time 
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also applies for this game. The computation time in the I/O-limited red-blue pebble game satisfies 
the following bound: 



T\ 2 \S,G,V) = VL 



fs 



Proof For the standard MHG, the lower bound to T\ (S,G, V) follows from the fact that 
every graph in T n has 0(n 3 ) vertices and Lemma 1 1.3.1. The lower bound to T 2 [S, G) 
follows from Corollary 11.4.1 and Lemma 11.5.1 and the lower bound to Xf {S,G,V) 
for the I/O-limited MHG follows from Theorem 1 1.3.1. 

We now describe a pebbling strategy that has the I/O time stated above and uses the 
obvious algorithm suggested by Fig. 1 1.6. If S red pebbles are available, let r = [^5/3] be 
an integer that divides n. (If r does not divide n, embed A, B and C in larger matrices for 
which r does divide n. This requires at most doubling n.) Let the n x n matrices A, B and 
C be partitioned into n/r x n/r matrices; that is, A = [ojj], B = [bij], and C = [c,j], 
whose entries are r x r matrices. We form the r xr submatrix c%j of C as the inner product 
of a row of r x r submatrices of A with a column of such submatrices of B: 



'■j 



9=1 



^i, q X OqJ 



We begin by placing blue pebbles on each entry in matrices A and B. Compute Cij by 
computing a^g x b q j for q = 1,2, ... ,r and adding successive products to the running 



sum. Keep r red pebbles on the running sum. Compute a^ q 



'<!:) 



by placing and holding 



r red pebbles on the entries in a^g and r red pebbles on one column of b q j at a time. Use 
two additional red pebbles to compute the r 2 inner products associated with entries of Cij 
in the fashion suggested by Fig. 1 1 .4 if r > 2 and one additional pebble if r = 1 . The 
maximum number of red pebbles in use is 3 if r = 1 and at most 1r +r + 2ifr > 2. 
Since 2r 2 + r + 2 < 3r 2 for r > 2, in both cases at most 3r 2 red pebbles are needed. Thus, 
there are enough red pebbles to play this game because r = [ \J S/3\ implies that 3r 2 < 5*, 
the number of red pebbles. Since r > 1, this requires that S > 3. 



□ □□□ 

DDDD 
DDDD 
DDDD 



DDDD 
DDDD 
DDDD 
DDDD 



X 



DDDD 
DDDD 
■ DDD 
DDDD 



C = A B 

Figure I 1.6 A pebbling schema for matrix multiplication based on the representation of a 
matrix in terms of block submatrices. A submatrix of C is computed as the inner product of a 
row of blocks of A with a column of blocks of B. 
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This algorithm performs one input operation on each entry of a iq and b q j to compute 
Ctj . It also performs one output operation per entry to compute Cij itself. Summing over 
all values of i and j, we find that n 2 output operations are performed on entries in C . Since 
there are (n/r) 2 submatrices a^ q and b Qi j and each is used to compute n/r terms c UiV , the 
number of input operations on entries in A and B is 2{n/r) 2 r 2 (n/r) = 2n 3 jr. Because 
r = \_yS/3\, we have r > \/S/3 — 1, from which the upper bound on the number of 
I/O operations follows. Since each product and addition vertex in each inner product graph 
is pebbled once, 0(n 3 ) computation steps are performed. 

(2) 

The bound on T 2 (S, G,V) for the I/O-limited game follows from two observations. 
First, the computational inequality of Theorem 10.4.1 provides a lower bound to Tj, the 
number of times that input vertices are pebbled in the red-pebble game when only red 
pebbles are used on vertices. This is the I/O-limited model. Second, the lower bound of 
Theorem 10.5.4 on T (actually, Tf) is of the form desired. ■ 

These results and the strategy given for the two-level case carry over to the multi-level case, 
although considerable care is needed to insure that the pebbling strategy does not fragment 
memory and lead to inefficient upper bounds. 

Even though the pebbling strategy given below is an I/O-limited strategy, it provides 
bounds on time in terms of space that match the lower bounds for the standard MHG. 

THEOREM I 1 .5.3 For every graph G in the family T n of inner product graphs for multiplying 
two n x n matrices and for every pebbling strategy V for G in the standard MHG with resource 
vector p that uses p\ > 3 first-level pebbles, the computation and I/O time satisfy the following 
lower bounds, where S; = J^ ■ ; Pj and k is the largest integer such that s& < 3n 2 : 

T^ L \p,G,V) = n(n 3 ) 

T W (vGP)= { n ^/^^ for2<l<k 
1 l P ' ' ' I Q ( n 2) for k+\ <l< L 

Furthermore, there is a pebbling strategy V for G with p\ > 3 such that the following upper bounds 
hold simultaneously: 



T { l L \p,G,V) = 0{n i ) 
T[ L \p,G,V) = 



{L), ^ ^^ _ I ° (nVv^TT) fir2<l<k 



[ O (n 2 ) for k + 1 < I < L 

In the I/O-limited MHG the upper bounds given above apply. The following lower bound on the 
I/O time applies to every graph G for n x n matrix multiplication and every pebbling strategy V ', 
where S = S£_i: 

Tl L \p, G,V) = n (n^/sTs) for \<1<L 



Proof The lower bounds on T ; (p,G,V),2 < I < L, follow from Theorems 11.3.1 and 
11.5.2. The lower bound on T ; (p, G,V) follows from the fact that every graph in T n 
has 0(n 3 ) vertices to be pebbled. 
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r 2 = r 1 LVi^T/(V3r 1 )J 
Figure 11.7 A three-level decomposition of a matrix. 



We now describe a multi-level recursive pebbling strategy satisfying the upper bounds 
given above. It is based on the two-level strategy given in the proof of Theorem 1 1.5.2. We 
compute C from A and B using inner products. 

Our approach is to successively block A, B, and C into r^ x r t submatrices for i = 
k,k — 1, . . . , 1 where the r, are chosen, as suggested in Fig. 1 1 .7, so they divide on another 
and avoid memory fragmentation. Also, they are also chosen relative to Sj so that enough 
pebbles are available to pebble r j x r^ submatrices, as explained below. 



f [V^\ 



n = < 



i = 1 



.i lyto-i + ^Av^-i-i)] * > 2 



Using the fact that 6/2 < a[6/aj < b for integers a and b satisfying 1 < a < b (see 
Problem 11.1), we see that ■ s /{s l -i+ 1)/12 < n < y/(si - i+ l)/3. Thus, Sj > 
3r| + i — 1. Also, r\ < n 2 because s^ < 3n 2 . 

By definition, s; pebbles are available at level / and below. As stated earlier, there is at 
least one pebble at each level above the first. From the s; pebbles at level I and below we 
create a reserve set containing one pebble at each level except the first. This reserve set is 
used to perform I/O operations as needed. 

Without loss of generality, assume that r^ divides n. (If not, n must be at most doubled 
for this to be true. Embed A, B, and C in such larger matrices.) A, B, and C are then 
blocked into r& x r^ submatrices (call them atj, bij, and Ci,j), and these in turn are blocked 
into r,t_i xrn submatrices, continuing until lxl submatrices are reached. The submatrix 
Ctj is defined as 
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Ci,j = 2_j a i,q x "g,j 

9=1 

As in Theorem 11.5.2, Cjj is computed as a running sum, as suggested in Fig. 11.4, 
where each vertex is associated with an r& X r^ submatrix. It follows that 3r| pebbles at 
level k or less (not including the reserve pebbles) suffice to hold pebbles on submatrices a^ q , 
b q j and the running sum. To compute a product a^ q x 6 gj , we represent a iq and b q> j as 
block matrices with blocks that are r^—x x r^—i matrices. Again, we form this product as 
suggested in Fig. 1 1.4, using $r\_ l pebbles at levels k — 1 or lower. This process is repeated 
until we encounter a product of r\ x r\ matrices, which is then pebbled according to the 
procedure given in the proof of Theorem 1 1.5.2. 

Let's now determine the number of I/O and computation steps at each level. Since all 
non-input vertices of G are pebbled once, the number of computation steps is 0(n ). I/O 
operations are done only on input and output vertices. Once an output vertex has been 
pebbled at the first level, reserve pebbles can be used to place a level-L pebble on it. Thus 
one output is done on each of the n output vertices at each level. 

We now count the I/O operations on input vertices starting with level k. nx n matrices 
A, B, and C contain r^ x rj. matrices, where r& divides n. Each of the {n/rj-) 1 submatrices 
ai iQ and b q j is used in (ro/rfc) inner products and at most r\ I/O operations at level k are 
performed on them. (If most of the s^ pebbles at level k or less are at lower levels, fewer 
level-fc I/O operations will be performed.) Thus, at most 2(n/rk) 2 (n/rk)r 2 , = 2n 2 /rk 
I/O operations are performed at level k. In turn, each of the Tf- x r& matrices contains 
(i"k/ r k-i) 2 r k-i x rfc_i matrices; each of these is involved in (rj,/rfe_ 1 ) inner products 
each of which requires at most r 2 ,_ 1 I/O operations. Since there are at most (n/rfc_i) 2 
Tk-i x r^_i submatrices in each of A, B, and C, at most 2n? /r^-i I/O operations are 
performed at level k — 1. Continuing in this fashion, at most 2n jri I/O operations are 
performed at level I for 2 < / < k. Since r; > \/(si — i + 1)/12, we have the desired 
conclusion. 

Since the above pebbling strategy does not place pebbles at level 2 or above on any vertex 
except input and output vertices, it applies in the I/O-limited case. The lower bound follows 
from Lemma 11.3.1 and Theorem 11.5.2. ■ 

11.5.3 The Fast Fourier Transform 

The fast Fourier transform (FFT) algorithm is described in Section 6.7.3 (an FFT graph is 
given in Fig. 11.1). A lower bound is obtained by the Hong-Kung method for the FFT by 
deriving an upper bound on the S'-span of the FFT graph. In this section all logarithms have 
base 2. 

LEMMA I 1.5.2 The S-span of the FFT graph F^ on n = 2 d inputs satisfies p(S,G) < 
2S log S when S < n. 

Proof p(S, G) is the maximum number of vertices of G that can be pebbled with S red 
pebbles from an initial placement of these pebbles, maximized over all such initial place- 
ments. G contains many two-input FFT (butterfly) graphs, as shown in Fig. 11.8. If V\ 
and t>2 are the output vertices in such a two-input FFT and if one of them is pebbled, we 
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V\ 




l>2 



II] g| Ik Hi 

Pi Pi 

Figure I 1 .8 A two-input butterfly graph with pebbles pi and pi resident on inputs. 



obtain an upper bound on the number of pebbled vertices if we assume that both of them 
are pebbled. In this proof we let {pi | 1 < i < S} denote the S pebbles available to pebble 
G. We assign an integer cost num(pi) (initialized to zero) to the ith pebble pi in order to 
derive an upper bound to the total number of pebble placements made on G. 

Consider a matching pair of output vertices V\ and V2 of a two-input butterfly graph 
and their common predecessors Mi and u 2 , as suggested in Fig. 11.8. Suppose that on the 
next step we can place a pebble on Ui. Then pebbles (call them pi and pi) must reside on 
U\ and u 2 . Advance p\ and p 2 to both V\ and v 2 . (Although the rules stipulate that an 
additional pebble is needed to advance the two pebbles, violating this restriction by allowing 
their movement to Vi and v 2 can only increase the number of possible moves, a useful effect 
since we are deriving an upper bound on the number of pebble placements.) 

After advancing p\ and p 2 , if num(pi) = num(p2), augment both by 1; otherwise, 
augment the smaller by 1 . Since the predecessors of two vertices in an FFT graph are in 
disjoint trees, there is no loss in assuming that all S pebbles remain on the graph in a 
pebbling that maximizes the number of pebbled vertices. Because two pebble placements 
are possible each time num(pi) increases by 1 for some i, p(S, G) < 2 Xa<j<s num {Pi)- 

We now show that the number of vertices that contained pebbles initially and are con- 
nected via paths to the vertex covered by pi is at least 2 numl - Pi \ That is, 2 ?mm ( pi ' < S 
or num(pi) < log 2 S, from which the upper bound on p(S, G) follows. Our proof is by 
induction. For the base case of num(pi) = 1, two pebbles must reside on the two immedi- 
ate predecessors of a vertex containing the pebble Pi. Assume that the hypothesis holds for 
num(j>i) < e — 1. We show that it holds for num(pi) = e. Consider the first point in 
time that num(p i ) = e. At this time Pi and a second pebble Pj reside on a matching pair 
of vertices, «i and v 2 . Before these pebbles are advanced to these two vertices from U\ and 
U2, the immediate predecessors of V\ and v 2 , the smaller of num(pi) and num(pj) has a 
value of e — 1. This must be pi because its value has increased. Thus, each of U\ and u 2 
has at least 2 e_1 predecessors that contained pebbles initially. Because the predecessors of U\ 
and u 2 are disjoint, each of v t and v 2 has at least 2 e = 2 num ^ Pi > predecessors that carried 
pebbles initially. ■ 

This upper bound on the 5-span is combined with Theorem 11.4.1 to derive a lower 
bound on the I/O time at level I to pebble the FFT graph. We derive upper bounds that match 
to within a multiplicative constant when the FFT graph is pebbled in the standard MHG. We 
develop bounds for the red-blue pebble game and then generalize them to the MHG. 
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THEOREM I 1.5.4 Let the FFT graph on n = 2 d inputs, F^, he pebbled in the red-blue 

pebble game with S red pebbles. When S > 3 there is a pebbling of F*- ' such that the following 
bounds hold simultaneously, where T[ {p\,F^ >) andT^ [p\,F^ >) are the computation and 
I/O time in a minimal pebbling ofF^h 



t[ 2 \s,f^) 

T 2 (2) (S*,F^) = 



0(nlogn) 
nlogn 
log S 



Proof The lower bound on 2j (S,F^ ') is obvious; every vertex in F^ ' must be peb- 
bled a first time. The lower bound on T 2 (S, F^ ') follows from Corollary 1 1.4.1, Theo- 
rem 11.3.1, Lemma 1 1.5.2, and the obvious lower bound on \V\. We now exhibit a pebbling 
strategy giving upper bounds that match the lower bounds up to a multiplicative factor. 

As shown in Corollary 6.7.1, i 7 ' ' can be decomposed into \d/e] stages, [_d/e\ stages 
containing 2 copies of F^ e > and one stage containing 2 copies of i* 1 ' ', k = d — 
[d/e\ e. (See Fig. 11.9.) The output vertices of one stage are the input vertices to the next. 
For example, F^ 1 ' can be decomposed into three stages with 2 ~ = 256 copies of i" ' 
on each stage and one stage with 2 12 copies of F^ ' , a single vertex. (See Fig. 11.10.) We use 
this decomposition and the observation that F^ e > can be pebbled level by level with 2 e + 1 
level- 1 pebbles without repebbling any vertex to develop our pebbling strategy for i* 1 ' ' . 

Given S red pebbles, our pebbling strategy is based on this decomposition with e = 
do = Uoga^ ~~ !)• Since S > 3, d > 1. Of the S red pebbles, we actually use only 



So 



1 . Since Sq < S, the number of I/O operations with So red pebbles is no 




6,1 ± 6,2 ' ' ' ± b,f3 

Figure I 1 .9 Decomposition of the FFT graph F (d) into j3 = 2 e bottom FFT graphs F^ d ~ e) 
and r = 2 _E top F^ e '. Edges between bottom and top sub-FFT graphs identify common 
vertices between the two. 
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i?(4) 



• • • 

• • • 

• • • 



Y 
256 

Figure 11.10 The decomposition of an FFT graph F ■ ' into three stages each containing 256 
copies of F ' . The gray areas identify rows of F^ in which inputs to one copy of F* ' are 
outputs of copies of F^ ' at the preceding level. 



less than with S red pebbles. Let d\ = [d/do\ . Then, F^ ' is decomposed into d\ stages 
each containing 2 d ~ d ° copies of F^"' and one stage containing 2 d ~ t copies of F^> where 
t = d — d d}. Since t < d > each vertex in F^' can be pebbled with Sq pebbles without 
re-pebbling vertices. The same applies to F^ "' . 

The pebbling strategy for the red-blue pebble game is based on this decomposition. 
Pebbles are advanced to outputs of each of the bottom FFT subgraphs F^' using 2*+ 1 < So 
red pebbles, after which the red pebbles are replaced with blue pebbles. The subgraphs F^ "' 
in each of the succeeding stages are then pebbled in the same fashion; that is, their blue- 
pebbled inputs are replaced with red pebbles and red pebbles are advanced to their outputs 
after which they are replaced with blue pebbles. 

This strategy pebbles each vertex once with red pebbles with the exception of vertices 
common to two FFT subgraphs which are pebbled twice. It follows that T : (S, F*- >) < 
2 +1 (d + 1) = 2n(log 2 n +1). This strategy also executes one I/O operation for each 
of the 2 inputs and outputs to F^ ' and two I/O operations for each of the 2 vertices 
common to adjacent stages. Since there are |~d/d ] stages, there are [d/dol — 1 such pairs 
of stages. Thus, the number of I/O operations satisfies T 2 l (S, F^) < 2 d+l \d/d ~\ < 
2n(log 2 n/(log 2 S/4) + 1) = 0(nlogri/logS). ■ 

The bounds for the multi-level case generalize those for the red-blue pebble game. As with 
matrix multiplication, care must be taken to avoid memory fragmentation. 



THEOREM I 1 .5.5 Let the FFT graph on n = 2 d inputs, F^ d \ be pebbled in ^standard MHG 

with resource vector p. Let si = X^=iPj andletk be the largest integer such that s ^ < n. When 
Pi > 3, the following lower bounds hold for all pebblings of F^' and there exists a pebblingV for 
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which the upper hounds are simultaneously satisfied: 

Q(nlogn) I = 1 

t[ l) (p,fW,V)= I 0(5^7) 2<l<k 

6(n) k+ 1 < I < L 

Proof Proofs of the first two lower bounds follow from Lemma 11.3.1 and Theorem 1 1 .5.4. 
The third follows from the fact that pebbles at every level must be placed on each input and 
output vertex but no intermediate vertex. We now exhibit a pebbling strategy giving upper 
bounds that match (up to a multiplicative factor) these lower bounds for all 1 < I < L. 
(See Fig. 11.9.) 

We define a non-decreasing sequence d = (d , d\, di, . ■ ■ , cLl—i) of integers used be- 
low to describe an efficient multi-level pebbling strategy for F", Let d = I and d\ = 
|_log(si — 1)J > 1, where Si = P\ > 3. Define m r and d r for 2 < r < L — 1 by 



in. 



Llogmin(s r — l,n)J 



d r -i 

(Xr* 1 1 bin Li/ip 1 

It follows that s r > 2 dr + 1 when s r < n + 1 since a[6/ a J ^ b. Because [logoj > 
(loga)/2 when a > 1 and also a\b/a\ > 6/2 for integers a and b when 1 < a < b (see 
Problem 11.1), it follows that d r > log(min(s r — \,n))/A. The values di are chosen to 
avoid memory fragmentation. 

Before describing our pebbling strategy, note that because we assume at least one pebble 
is available at each level in the hierarchy, it is possible to perform an I/O operation at each 
level. Also, pebbles at levels less than I can be used as though they were at level /. 

Our pebbling strategy is based on the decomposition of F^ > into FFT subgraphs F^ k > , 
each of which is decomposed into FFT subgraphs F^ dk -'> , and so on, until reaching FFT 
subgraphs F^ ' that are two-input, two-output butterfly graphs. To pebble F" we apply 
the strategy described in the proof of Theorem 11.5.4 as follows. We decompose F* ' 
into c^/rfi stages, each containing 2 2 ~ ' copies of F^- 1 ' , which we pebble with Si = p\ 
first-level pebbles using this strategy. By the analysis in the proof of Theorem 1 1.5.4, 2 2+ 
level-2 I/O operations are performed on inputs and outputs to F^ -' as well as another 2 2+1 
level-2 I/O operations on the vertices between two stages. Since there are dx/di stages, a 
total of (d,2/di)2 2+1 level-2 I/O operations are performed. We then decompose F*- ' into 
d^/d2 stages each containing 2 3_ 2 copies of F^ ' . We pebble F^ ' with S2 pebbles at level 
1 or 2 by pebbling copies of -F^ in stages, using (d i /d 2 )2 3+1 level-3 I/O operations and 
using (d^/d2)2 di ~ dl times as many level-2 I/O operations as used by F'- 2 ' . Let n 2 be the 
number of level-2 I/O operations used to pebble F^. Then n\ = (d i /d l )2 di+1 . 

Continuing in this fashion, we pebble F^ r ', 1 < r < k, with s r _i pebbles at levels I or 
below by pebbling copies of F^ r ~ 1 ' in stages, using (d r / d r -\)2 r+l level-r I/O operations 
and using (d r J d r -\)2 dr ~ dr - 1 as many level-j I/O operations for 1 < j < r — 1. Let nf 
be the number of level- j I/O operations used to pebble F^ r ' . By induction it follows that 
n { p = (d r /d j )2 d '+ 1 . 

For r > k, the number of pebbles available at level r or less is at least 2+1, which is 
enough to pebble F*- > by levels without performing I/O operations above level k + 1; this 
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means that I/O operations at these levels are performed only on inputs, giving the bound 
T^ L \p,F^ d \V) = 0{n),n= 2 d , for k + 1 < r < L. When r < k, we pebble F^ by 
decomposing it into \d/dk] stages such that each stage, except possibly the first, contains 
2 d ~ dk copies of the FFT subgraph F^ dk \ The first stage has 2 d ~ d copies of F^ d ) of depth 
d* = d—{\d/dk\ — l)dk, which we treat as subgraphs of the subgraph F^ k > and pebble to 
completion with a number of operations at each level that is at most the number to pebble 
p{dk) _ £ acn instance of -F' k ' is pebbled with s^-i pebbles at level k — 1 or lower and 
a pebble at level k or higher is left on its output. Since s^+i > n + 1, there are enough 
pebbles to do this. 

Thus T/ L) (p, F^ d \V) satisfies the following bound for 1 < I < L: 

T^ L \p,F^ d \V) < Id/d^-^T^&F^.V) 

Combining this with the earlier result, we have the following upper bound on the number 
of I/O operations for 1 < I < fc: 



T^\p,F^,V)< \d/dk\{d k /di)2 



d+l 



Since, as noted earlier, d r > log(min(s r — 1, n))/4, we obtain the desired upper bound on 



T ; (p, F"', V) by combining this result with the bound on n\ given above. 



The above results are derived for standard MHG and the family of FFT graphs. We now 
strengthen these results in two ways when the I/O-limited MHG is used. First, the I/O limita- 
tion requires more time for a given amount of storage and, second, the lower bound we derive 
applies to every graph for the discrete Fourier transform, not just those for the FFT. 

It is important to note that the efficient pebbling strategy used in the standard MHG 
makes extensive use of level-L pebbles on intermediate vertices of the FFT graph. When this is 
not allowed, the lower bound on the I/O time is much larger. Since the lower bounds for the 
standard and I/O-limited MHG on matrix multiplication are about the same, this illustrates 
that the DFT and matrix multiplication make dramatically different use secondary memory. 
(In the following theorem a linear straight-line program is a straight-line program in which 
the operations are additions and multiplications by constants.) 

THEOREM I 1 .5.6 Let FFT{n) be any DAG associated with the DFT on n inputs when real- 
ized by a linear straight-line program. LetFFT(n) be pebbled with strategy V in the I/O-limited 
MHG with resource vector p and let s/ = ^2 i=l Pj- If S = Sl—i < n, then for each pebbling 
strategy V , the computation and IIO time at level I must satisfy the following bounds: 



T^>(p,FFT{n),V) = ni—j for 1 < I < L 

Also, when n = 2 , there is a pebbling V of the FFT graph F^ > such that the following relations 
hold simultaneously when S > 2 log n: 



Of^ + nlogS 1= 1 
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Proof The lower bound follows from Theorem 1 1.3.1 and Theorem 10.5.5. We show that 
the upper bounds can be achieved on F^ ' under the I/O limitation simultaneously for 
\ <l <L. 

The pebbling strategy meeting the lower bounds is based on that used in the proof of 
Theorem 10.5.5 to pebble F^ ' using S < 2 d + 1 pebbles in the red pebble game. The 
number of level- 1 pebble placements used in that pebbling is given in the statement of 
Theorem 10.5.5. A level-2 I/O operation occurs once on each of the 2 outputs and 2 
times on each of the 2 d inputs of the bottom FFT subgraphs, for a total of 2 d (l d ~ e + 1) 
times. 

The pebbling for the L-level MHG is patterned after the aforementioned pebbling for 
the red pebble game, which is based on the decomposition of Lemma 6.7.4. (See Fig. 11.9.) 
Let e be the largest integer such that S > 2 e + d — e. Pebble the binary subtrees on 
2 inputs in the 2 e bottom subgraphs F^ as follows: On an input vertex level- L 

pebbles are replaced by pebbles at all levels down to and including the first level. Then level- 

1 pebbles are advanced on the subtrees in the order that minimizes the number of level- 1 
pebbles in the red pebble game. It may be necessary to use pebbles at all levels to make these 
advances; however, each vertex in a subtree (of which there are 2 +1 — 1) experiences at 
most two transitions at each level in the hierarchy. In addition, each vertex in a bottom 
tree is pebbled once with a level- 1 pebble in a computation step. Therefore, the number of 
level-Z transitions on vertices in the subtrees is at most 2 + (2 ~ e+ — 1) for 2 < I < L, 
since this pebbling of 2 e subtrees is repeated 2 times. 

(e) 

Once the inputs to a given subgraph F^ have been pebbled, the subgraph itself is 
pebbled in the manner indicated in Theorem 11.5-5, using 0(e2 e /logs/„i) pebbles at 
each level I for 2 < I < L. Since this is done for each of the 2 d ~ e subgraphs F^ , it 
follows that on the top FFT subgraphs a total of 0(e2 / log si-\) level-Z transitions occur, 

2 < I < L. In addition, each vertex in a graph F t is pebbled once with a level- 1 pebble 
in a computation step. 

It follows that at most 



Tl L \p,F (d \V) = O 2 d (2 d - e+l - 1) 



e2 d 



logsz-i 



leveW I/O operations occur for 2 < I < L, as well as 

t[ L \ P ,F^,V) = 0(2 d (2 d ~ e+1 - 1) + e2 d ) 

computation steps. It is left to the reader to verify that 2 e < 2 e +d—e < S < 2 e+1 +d—e— 
1 < 42 e when e + 1 > log d (this is implied by S > 2d), from which the result follows. ■ 

11.5.4 Convolution 

The convolution function / c "nv : Ji^+m t _^ j^n+m-i oyer a commutat i V e ring H (see 
Section 6.7.4) maps an n-tuple a and an m-tuple b onto an (n + m — l)-tuple c and is 
denoted c = a ® b. An efficient straight-line program for the convolution is described in 
Section 6.7.4 that uses the convolution theorem (Theorem 6.7.2) and the FFT algorithm. 
The convolution theorem in terms of the 2n-point DFT and its inverse is 

a®b = F^ n \F 2n {a)y.F 2n {b)) 
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Obviously, when n = 2 d the 2n-point DFT can be realized by the 2n-point FFT. The DAG 
associated with this algorithm, shown in Fig. 11.11 for d = 4, contains three copies of the 
FFT graph F^ 2d \ 

We derive bounds on the computation and I/O time in the standard and I/O-limited 
memory-hierarchy game needed for the convolution function using this straight-line program. 
For the standard MHG, we invoke the lower bounds and an efficient algorithm for the FFT. 
For the I/O-limited MHG, we derive new lower bounds based on those for two back-to-back 
FFT graphs as well as upper bounds based on the I/O-limited pebbling algorithm given in 
Theorem 1 1.5.4 for FFT graphs. 

THEOREM I 1 .5.7 Let G co l tvo \ ve be the graph of a straight-line program for the convolution of 



two n-tuples using the convolution theorem, n = 2 . Let G 



(n) 

convolve 



be pebbled in the standard 



MHG with the resource vector p. Let si = 2 7 -=i Pj an( ^ ^ et k be the largest integer such that 



Sfe < n. When pi > 3 there is a pebbling of G 



(n) 



jive 



for which the following bounds hold 



<L) 



(p,F 



Mh 



0(nlogn) 



e 



logsj_ 



0(n) 



1 = 1 

2 < I < k+ 1 
k+2 < I < L 



Proof The lower bound follows from Lemma 11.3.2 and Theorem 11.5.5. From the for- 
mer, it is sufficient to derive lower bounds for a subgraph of a graph. Since F*- ' is contained 

m ^convolve' tne l° wer bound follows. 




Figure 11.11 A DAG for the graph of the convolution theorem on n = 8 inputs. 
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The upper bound follows from Theorem 11.5.5. We advance level-L pebbles to the 
outputs of each of the two bottom FFT graphs F { • > in Fig. 11.11 and then pebble the top 
FFT graph. The number of I/O and computation steps used is triple that used to pebble 
one such FFT graph. In addition, we perform O(n) I/O and computation steps to combine 
inputs to the top FFT graph. ■ 

The bounds for the I/O-limited version of the MHG for the convolution problem are 
considerably larger than those for the standard MHG. They have a much stronger dependence 
on 5 and n than do those for the FFT graph. 

THEOREM I 1 .5.8 Let Convolve ^ e the graph of any DAG for the convolution of two n-tuples 
using the convolution theorem, n = 2 . Let Convolve ^e pebbled in the L/O-limited MHG 
with the resource vector p and let si = ^2~ =l Pj- IfS = Sl—i < n, then the time to pebble 

^convolve at the Ith level, T ; (p, -^convolve)' sat ' s fi es the following lower bounds simu 
fori <l < L: 



" 5 2 



TTV^voivc)^ 



when S < n/ log n. 

Proof A lower bound is derived for this problem by considering a generalization of the 
graph shown in Fig. 11.11 in which the three copies of the FFT graph F^ ld > are replaced by 
an arbitrary DAG for the DFT This could in principle yield in a smaller lower bound on the 
time to pebble the graph. We then invoke Lemma 11.3.2 to show that a lower bound can 
be derived from a reduction of this new graph, namely, that consisting of two back-to-back 
DFT graphs obtained by deleting one of the bottom FFT graphs. We then derive a lower 
bound on the time to pebble this graph with the red pebble game and use it together with 
Theorem 1 1.3.1 to derive the lower bounds mentioned above. 

Consider pebbling two back-to-back DAGs for the DFT on n inputs, n even, in the red 
pebble game. From Lemma 10.5.4, the n-point DFT function is (2, n,n,n/ 2) -indepen- 
dent. From the definition of the independence property (see Definition 10.4.2), we know 
that during a time interval in which 2(5 + 1) of the n outputs of the second DFT DAG 
on n-inputs are pebbled, at least n/2 — 2(S + 1) of its inputs are pebbled. In a back-to- 
back DFT graph these inputs are also outputs of the first DFT graph. It follows that for 
each group of 2(5 + 1) of these n/2 — 2(5 + 1) outputs of the first DFT DAG, at least 
n/2 — 2(5 + 1) of its inputs are pebbled. Thus, to pebble a group of 2(5 + 1) outputs 
of the second FFT DAG (of which there are at least [n/(2(S + 1))J groups), at least 
[{n/2 - 2(5 + l))/2(5 + 1)J (n/2 - 2(5 + 1)) inputs of the first DFT must be pebbled. 
Thus, T, (L) (p, ff c ( ™„ vo ivc) > "7(64(5 + l) 2 ), since it holds both when 5 < n/4^2 and 
when 5 > n/4\/2. 

Let's now consider a pebbling strategy that achieves this lower bound up to a multiplica- 
tive constant. The pebbling strategy of Theorem 11.5.5 can be used for this problem. It 
represents the FFT graph F^ ' as a set of FFT graphs F^ e ' on top and a set of FFT graphs 
F^ ' on the bottom. Outputs of one copy of F^ e ' are pebbled from left to right. This 
requires pebbling inputs of F^ > from left to right once. To pebble all outputs of F^ ', 2 
copies of F^' are pebbled and the 2 d inputs to F^ d > are pebbled 2 d ~ e times. 
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Figure 11.12 An I/O-limited pebbling of a DAG for the convolution theorem showing the 
placement of eight pebbles. 



Consider the graph G convolvo consisting of three copies of i* 1 ' ' , two on the bottom and 
one on top, as shown in Fig. 11.12. Using the above strategy, we pebble the outputs of the 
two bottom copies of F^ > from left to right in parallel a total of 2 times. The outputs 
of these two graphs are pebbled in synchrony with the pebbling of the top copy of F^ ' , It 
follows that the number of I/O and computation steps used on the bottom copies of F^ > 
is 2(2 ~ e ) times the number on one copy, with twice as many pebbles at each 



G 



0/2) 
convolve 



level plus the number of such steps on the top copy of F^ > . It follows that Gj^ nvoivc can 
be pebbled with three times the number of pebbles at each level as can F'- ', with 0(2 ) 
times as many steps at each level. The conclusion of the theorem follows from manipulation 
of terms. ■ 

The bounds given above also apply to some permutation and merging networks. Since, 
as shown in Section 6.8, the graph of Batcher's bitonic merging network is an FFT graph, 
the bounds on I/O and computation time given earlier for the FFT also apply to it. Also, as 
shown in Section 7.8.2, since a permutation network can be constructed of two FFT graphs 
connected back-to-back, the lower bounds for convolution apply to this graph. (See the proofs 
of Theorems 11.5.7 and 11.5.8.) The same order-of-magnitude upper bounds follow from 
constructions that differ only in details from those given in these theorems. 



Mi Block I/O in the MHG 

Many memory units move data in large blocks, not in individual words, as generally assumed 
in the above sections. (Note, however, that one pebble can carry a block of data.) Data is 
moved in blocks because the time to fetch one word and a block of words is typically about the 
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Figure 11.13 A disk unit with three platters and two heads per disk. Each track is divided into 
four sectors and heads move in and out on a common arm. The memory of the disk controller 
holds the contents of one track on one disk. 



same. Figure 11.13 suggests why this is so. A disk spinning at 3,600 rpm that has 40 sectors 
per track and 512 bits per sector (its block size) requires about 10 msec to find data in the track 
under the head. However, the time to read one sector of 64 bytes (512 bits) is just .42 msec. 

To model this phenomenon, we assume that the time to access k disk sectors with con- 
secutive addresses is a + kf3, where a is a large constant and is a small one. (This topic is 
also discussed in Section 7.3.) Given the ratio of a to /?, it makes sense to move data to and 
from a disk in blocks of size about equal to the number of bytes on a track. Some operating 
systems move data in track-sized blocks, whereas others move them in smaller units, relying 
upon the fact that a disk controller typically keeps the contents of its current track in a fast 
random-access memory so that successive sector accesses can be done quickly. 

The gross characteristics of disks described by the above assumption hold for other storage 
devices as well, although the relative values of the constants differ. For example, in the case of a 
tape unit, advancing the tape head to the first word in a consecutive sequence of words usually 
takes a long time, but successive words can be read relatively quickly. 

The situation with interleaved random-access memory is similar, although the physi- 
cal arrangement of memory is radically different. As depicted in Fig. 11.14, an interleaved 
random-access memory is a collection of 2 r memory modules, r > 1, each containing 2 
6-bit words. Such a memory can simulate a single 2 -word 6-bit random-access memory. 

2 r are stored in the first module, words with 
2 r + 1 in the second module, and words with 
. . . , 2 r+k — 1 in the last module. 



Words with addresses 0, 2 r , 2 2 r , 3 2 r , . . 
addresses 1, 2 r + 1, 2 2 r + 1, 3 2 r + 1, . . 
addresses 2 r - 1, 2 2 r - 1, 3 2 r - 1, 4 2 r 



,2 k ~ 
,2 k ~ 
- 1, 



To access a word in this memory, the high order k bits are provided to each module. If 
a set of words is to be read, the words with these common high-order bits are copied to the 
registers. If a set of words is to be written, new values are copied from the registers to them. 

When an interleaved memory is used to simulate a much faster random-access memory, 
a CPU writes to or reads from the 2 r registers serially, whereas data is transferred in parallel 
between the registers and the modules. The use of two sets of registers (double buffering) 
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Figure 11.14 Eight interleaved memory modules with double buffering. Addresses are supplied 
in parallel while data is pipelined into and out of the memory. 



allows the register sets to be alternated so that data can be moved continuously between the 
CPU and the modules. This allows the interleaved memory to be about 2 r times slower than 
the CPU and yet, with a small set of fast registers, appear to be as fast as the CPU. This works 
only if the program accessing memory does not branch to a new set of words. If it does, the 
startup time to access a new word is about 2 r times the CPU speed. Thus, an interleaved 
random-access memory also requires time of the form a + k[3 to access k words. For example, 
for a moderately fast random-access chip technology a might be 80 nanoseconds whereas j3 
might be 10 nanoseconds, a ratio of 8 to 1. 

This discussion justifies assuming that the time to move k words with consecutive addresses 
to and from the Ith unit in the memory hierarchy is on + kfii for positive constants oci and 
{3i, where a; is typically much larger than /?;. If k = bi = [ai//3/], then a>i + kj3i fa 2ai 
and the time to retrieve one item and 6; items is about the same. Thus, efficiency dictates that 
items should be fetched in blocks, especially if all or most of the items in a block can be used if 
one of them is used. This justifies the block-I/O model described below. Here we let ti be the 
time to move a block at level I. We add the requirement that data stored together be retrieved 
together to reflect physical constraints existing in practice. 

DEFINITION I 1 .6. 1 (Block-I/O Model) At the Ith level in a memory hierarchy, I/O operations 
are performed on blocks. The block size and the time in seconds to access a block at the Ith level are 
bi andti, respectively. For each I, bi/bi_\ is an integer. In addition, any data written as part of a 
block at level I must be read into level I — 1 by reading the entire block in which it was stored. 

The lower bounds on the number of I/O steps given in Section 11.5 can be generalized to 
the block-I/O case by dividing the number of I/O operations by the size b\ of blocks moving 
between levels I — 1 and I. This lower bound can be achieved for matrix-vector and matrix- 
matrix multiplication because data is always written to and read from the higher-level memory 
in the same way for these problems. (See Problems 11.13 and 11.14.) 
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For the FFT graph in the standard MHG, instead of pebbling FFT subgraphs on 2 r 
inputs, we pebble bi FFT subgraphs on 2 r jbx inputs (assuming that bi is a power of 2). 
Doing so allows all the data moving back and forth in blocks between memories to be used 
and accommodates the transposition mentioned at the beginning of Section 11.5.3. This 
provides an upper bound of 0(nlogn/(bi-i log(s;_i/6;_i))) on the I/O time at level /. 
Clearly, when 6/_i is much smaller than s;_i, say 6/_i = 0(^/s;_i), the upper and lower 
bounds match to within a multiplicative factor. (This follows because we divide n by 6;_i and 
log bi-i = 0(log s;_i).) These observations apply to the FFT-based problems as well. 

11.7 Simulating a Fast Memory in the MHG 

In this section we revisit the discussion of Section 11.1.2, taking into account that a memory 
hierarchy may have many levels and that data is moved in blocks. 

We ask the question, "How do we assess the effectiveness of a memory hierarchy on a 
particular problem?" For several problems we have upper and lower bounds on their number of 
computation and I/O steps in memory hierarchies parameterized by block sizes and numbers of 
storage locations. If we add to this mix the time to move a block between levels, we can derive 
bounds on the time for all computation and I/O steps. We then ask under what conditions 
this time is the best possible. Since data must typically be stored and retrieved from archival 
memory, we cannot expect the performance to exceed that of a two-level hierarchy (modeled 
by the red-blue pebble game) in which all the available storage locations, except for those in 
the archival memory, are in first-level storage. For this reason we use the two-level memory 
as our reference model. We now define these terms and state a condition for optimality of a 
pebbling strategy. 

For 1 < I < L — 1 we let ti be the time to move one block of 6/ words between levels I— I 
and Z of a memory hierarchy, measured as a multiple of the time to perform one computation 
step. Thus, the time for one computation step is ti = 1. 

Let V be a pebbling strategy for a graph G in the L-level MHG that uses the resource 
vector p = (pi,p2, . . . ,pl-\) (pi pebbles are used at the Ith level) and moves data in blocks 
of size specified by b = (&2, ^3, ■ ■ ■ , b^j (bi words are moved between levels (I — 1) and I). Let 
TJ (p, b, G) denote the number of level-Z I/O operations with V on G. We define the time 
for the pebbling strategy "P, T(V, G) on the graph G as 

L 

T(V,G)= ^>-T/ L) (p,b,G) 

Thus, T(V, G) measures the absolute time expended to pebble a graph relative to the time 
to perform one computation step under the assumption that I/O operations cannot be over- 
lapped. 

From the above discussion, a pebbling is efficient if T(7 :> , G) is at most some small multiple 
of Tj (s£-i> G), the normalized time to pebble G in the red-blue pebble game when all the 
pebbles at level L — 1 or less in the MHG (there are sl-i such pebbles) are used as if they 
were red pebbles. 

A two-level computation exhibits locality of reference if it is likely in the near future 
to refer to words currently in its primary memory. Such computations perform fewer I/O 
operations than those that don't meet this condition. This idea extends to multiple levels: a 
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multi-level memory hierarchy exhibits locality of reference if it uses its higher-level memory 
units much less often that its lower-level units. Formally, we say that a pebbling strategy V is 
c-local iiT{V, G) satisfies the following inequality: 

L 

Y,ti-Tl L) ( P ,b,G,V)<cT[ 2 \s L _ u G) 

2 = 1 

The definition of a c-local pebbling strategy is illustrated by the results for matrix multipli- 
cation in the standard MHG when block I/O is not used. Let k be the largest integer such that 
s k _• 3tj . From Theorem 1 1.5.3 for matrix-matrix multiplication, we see that there exists an 
optimal pebbling if 

(n.i) 




for some c* > since T$ Z \s, G) = 9(n 3 ). 

We noted in Section 11.1.2 that the imbalance between the computation and I/O times 
for matrix multiplication is becoming ever more serious with the advance of technology. We 
re-examine this issue in light of the above condition. Consider the case in which k + 1 = L; 
that is, the highest-level memory is used to store the arguments and results of a computation. 
In this case the second term on the left-hand side of (11.1) is a relative measure of the time 
to bring data into lower-level memories from the highest-level memory. It is negligible when 
nbL is large. For example, if ix = 2,000,000 and &L = 10,000, say, then n must be at least 
200, a modest-sized matrix. The first term on the left-hand side reflects the number of times 
data moves between the levels of the hierarchy holding the data. It is small when bi y'Sz-i 
is large by comparison with i; for 2 < I < k, a condition that is not hard to meet. For 
example, if Sj-i = 32 x 10 6 (about 4 Mbytes) and bi = 1,000, then ti must be less than 
about 45, a condition that certainly applies to low level memories such as today's random- 
access memories. Problems 11.15 and 11.16 provide opportunities to explore this issue with 
the FFT and convolution. 

J_L8 RAM-Based I/O Models 

The MHG assumes that computations are done by pebbling the vertices of a directed acyclic 
graph. That is, it assumes that computations are straight-line. While the best known algo- 
rithms for the problems studied earlier in this chapter are straight-line, some problems are not 
efficiently done in a straight-line fashion. For example, binary search in a tree that holds a set 
of keys in sorted order (see Section 11.9.1) is much better suited to data-dependent compu- 
tation of the kind allowed by an unrestricted RAM. Similarly, the merging of two sorted lists 
can be done more efficiently on a RAM than with a straight-line program. For this reason 
we consider RAM-based I/O models, specifically the block-transfer model and the hierarchical 
memory model. 

11.8.1 The Block-Transfer Model 

The block-transfer model is a two-level I/O model that generalizes the red-blue pebble game 
to RAM-based computations by allowing programs that are not straight-line. 
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DEFINITION I 1 .8. 1 The block-transfer model (BTM) is a serial computer in which a CPU is 
attached to an M -word primary memory and to a secondary memory of unlimited size that stores 
words in blocks of size B. Words are moved in blocks between the memories and words that leave 
primary memory in one block must return in that block. An I/O operation is the movement of a 
block to or from secondary memory. The I/O time with the BTM is the number of I/O operations. 

The secondary memory in the BTM can be a main memory if the primary memory is a 
cache, or can be a disk if the primary memory is a random-access memory. In fact, it can model 
I/O operations between any two devices. Since a block can be viewed as the contents of one 
track of a disk, the time to retrieve any word on the track is comparable to the time to retrieve 
the entire track (See Section 1 1.6.) Since data is moved in blocks in the BTM, it makes sense 
to define simple I/O operations. 

DEFINITION I 1 .8.2 An I/O operation in the BTM is simple if after a block or word is copied 
from one memory to the other, the copy in the first memory is deleted. 

Simple I/O operations for the pebble game are defined in Problem 11.10. In this problem 
the reader is asked to show that replacing all I/O operations with simple I/O operations has 
the effect of at most doubling the number of I/O operations. The proof of this fact applies 
equally well to the BTM. 

We illustrate the use of the block-transfer model by examining the sorting problem. We 
derive a lower bound on the I/O time for all sorting algorithms and exhibit a sorting algorithm 
that meets the lower bound, up to a constant multiplicative factor. To derive the lower bound, 
we limit the range of sorting algorithms to those based on the comparison of keys, as stated 
below. (Sorting algorithms that are not comparison-based, such as the various forms of radix 
sort, assume that keys consist of individual digits and that digits are used to classify keys.) 

ASSUMPTION 11.8. 1 All words to be sorted are located initially in the secondary memory. The 
compare-exchange operation is the only operation available to implement sorting algorithms on 
the BTM. In addition, an arbitrary permutation of the contents of the primary memory of the BTM 
can be done during the time required for one I/O operation. 

The assumption that the CPU can perform an arbitrary permutation on the contents of the 
primary memory during one I/O operation acknowledges that I/O operations take a very long 
time relative to CPU instructions. 

Algorithms consistent with these assumptions are described by the multiway decision trees 
discussed below. They are a generalization of the binary decision tree, a binary tree in which 
each vertex has associated with it a comparison between two variables. For example, if keys X\ 
and X2 are compared at the root vertex, the comparison has two outcomes, namely X\ < X2 or 
X\ > x 2 , which are associated with the subtrees to the left and right of the root, respectively. 
Similar comparisons and outcomes are possible at each vertex of these two subtrees. A sequence 
of comparisons terminates on a leaf node. 

Since a binary decision tree captures each of the data-dependent comparisons between keys 
in comparison-based sorting algorithm, each leaf is associated with the permutation of the 
original sequence of variables that puts the sequence into sorted order. Thus, a binary decision 
tree for sorting must have at least n\ distinct leaves, one for every permutation of n items. The 
length of a path through a binary decision tree is the number of comparisons performed on the 
particular input, and the length of the longest path is a measure of the worst-case number of 
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comparisons. A binary tree with N leaves has a longest path of length at least log 2 N because 
if it were smaller, it would have fewer than 2 gl < N leaves. Since the length of the longest 
path is an integer, it must be at least [log 2 N~\ . We summarize this result as a lemma that uses 
the lower bound on n\ given in Problem 2.23. 

LEMMA I 1 .8. 1 The length of the longest path in a binary decision tree that sorts n inputs is at 
least [log 2 n\~\ = 0(nlogn). 

The multiway decision tree in Fig. 11.15 extends the above concept by permitting multi- 
ple comparisons at each vertex. 2 outcomes are possible if k comparisons of variable pairs are 
associated with each vertex. 

THEOREM I 1 .8. 1 Let B divide M and M divide n. Under Assumption 11.8.1 on the BTM, 
in the worst case the number of block I/O steps to sort a set ofn records using M words of primary 
memory and block size B, XBTMsort(?i)> satisfies the following bounds for B < M/2 and M 



t BTMsort 



(n) 



I max 



n (n/B)log(n/B) 
B' 



log(M/B) 

Proof Let's now apply the multiway decision tree to the BTM. Since each path in such a tree 
corresponds to a sequence of comparisons by the CPU, the tree must have at least n\ leaves. 
To complete the lower-bound derivation we need to determine the number of descendants 
of vertices in the multiway tree. 

Initially the n unsorted words are stored in n/B blocks in the secondary memory. The 
first time one of these blocks is moved to the primary memory, up to B\ permutations 
can be performed on the words in it. No more permutations are possible between these 
words no matter how many times they are simultaneously in primary memory, even if they 
return to the memory as members of different blocks. When a block of B words arrives in 
the M -word memory, the number of possible permutations between them (given that the 
order among the M — B words originally in the memory has previously been taken into 




Figure 11.15 A multiway decision tree in which multiple comparisons of keys are made at each 
vertex. 
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account, as has the order among the B words in a block) is at most p = ( B ), the binomial 
coefficient. (To see this, observe that places for the B new (and indistinguishable) words in 
the primary memory can be any B of the M indistinguishable places.) It follows that the 
multi-comparison decision tree for every BTM comparison-based sorting algorithm on the 
BTM has at most n/B vertices with at most pB\ possible outcomes (vertices corresponding 
to the first arrival of one of the blocks in primary memory) and that each of the other vertices 
has at most p outcomes. 

It follows that if a sorting algorithm executes TBTMsort (?i) block I/O steps, the function 
^BTMsort(^) must satisfy the following inequality: 



(£!) 



\\n/B 



B 



^BTMsortW 



> n\ 



Using the approximation to n\ given in Lemma 1 1.8.1, the upper bound of (M/B) e on 
(„) derived in Lemma 10.12.1, and the fact thatT > n/B, we have the desired conclusion. 
An upper bound is obtained by extending the standard merging algorithm to blocks of 
keys. The merging algorithm is divided into phases, an initialization phase and merging 
phases, each of which takes (2n/B) I/O operations. In the initialization phase, a set of 
n/M sorted sublists of M keys or M/B blocks is formed by bringing groups of M keys into 
primary memory, sorting, and then writing them out to secondary memory. In a merging 
phase, M/B sorted sublists of L blocks (L = M/B in the first merging phase) are merged 
into one sorted sublist of ML/ B blocks, as suggested in Fig. 11.16. The first block of keys 
(those with the smallest values) in each sublist is brought into memory and the B smallest 
keys in this set is written out to the new sorted sublist that is being constructed. If any 
block from an input sublist is depleted, the next block from that list is brought in. There 
is always sufficient space in primary memory to do this. Thus, after k phases the sorted 
sublists contain (M/B) k blocks. When (M/B) k > n/B, the merging is done. Thus, 
(2n/B) \log 2 (n/B)/ log 2 (M/B)~\ I/O operations are performed by this algorithm. ■ 



B 



Secondary Memory 



Primary 
Memory 




Secondary Memory 



• • • 



Figure 11.16 The state of the block merging algorithm after merging four blocks. The algo- 
rithm merges M/B sublists, each containing L blocks of B keys. 
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Similar results can be obtained for the permutation networks defined in Section 7.8.2 (see 
Problem 11.18), the FFT defined in Section 6.7.3 (see Problem 11.19), and matrix transposi- 
tion defined in Section 6.5.4 (see [9]). 

11.9 The Hierarchical Memory Model 

In this section we define the hierarchical memory model and derive bounds on the time to do 
matrix multiplication, the FFT and binary search in this model. These results provide another 
opportunity to evaluate the performance of memory hierarchies, this time with a single cost 
function applied to memory accesses at all levels of a hierarchy. We make use of lower bounds 
derived earlier in this chapter. 

DEFINITION I 1.9.1 The hierarchical memory model (HMM) is a serial computer in which a 
CPU without registers is attached to a random-access memory of unlimited size for which the time 
to access location a for reading or writing is the value of a monotone nondecreasing cost function 
v(a) : N h N from the integers IN = {0, 1, 2, 3, . . .} to IN. The cost of computing 
j(n) . j^n | _ > j^rn w j tn tne jjjyjM using the cost function v{a), /C„(/), is defined as 

T{ X ) 
3=1 

where a,j, 1 < j < T(x), is the address accessed by the CPU on the jth computational step and 
T(x) is the number of steps when the input is x. 

The HMM with cost function v(a) = 1 is the standard random-access machine described 
in Section 3.4. While in principle the HMM can model many of the details of the MHG, it 
is more difficult to make explicit the dependence of v(a) on the amount of memory at each 
level in the hierarchy as well as the time for a memory access in seconds at that level. Even 
though the HMM can model programs with branching and looping, following [7] we assume 
straight-line programs when studying the FFT and matrix-matrix multiplication problems with 
this model. 

Let n(f, x, a) be the number of times that address a is accessed in the HMM for / on 
input x. It follows that the cost K, u (f) can be expressed as follows: 

K-v(f) = max y. n(f,x,a)v(a) (11-3) 

l<a 

Many cost functions have been studied in the HMM, including v{a) = |~log 2 a] , v(a) = 
a a , and v(a) = U m (a), where U m (a) is the following threshold function with threshold m: 



U m {a) 
It follows that 



1 a > m 
otherwise 



fcu m (/) = max ^2 n (f' x > a ) 
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For the matrix-matrix multiplication and FFT problems, the cost K.\j m (/) of computing / is 
directly related to the number of I/O operations with the red-blue pebble game played with 
S = m red pebbles discussed in Sections 11.5.2 and 11.5.3. For this reason we call this cost 
I/O complexity. The principal difference is that in the HMM no cost is assessed for data 
stored in the first m memory locations. 

Let the differential cost function Av{a) be defined as 

Av(a) = v(a) — v(a — 1) 

As a consequence, we can write v{a) as follows if we set v{— 1) = 0: 

1/(0)= V Av(b) 



0<b<a 



Since v(a) is a monotone nondecreasing function, Az/(m) is nonnegative. 
Rewriting (1 1.3) using (1 1.4), we have 



£„(/) = max^rj(/,x,a) ^ Av(b) 

\<a 0<b<a 



oo 


OO 


max 2_. Az/(c) >, n (f> x > d) 


c^O d=c 


oo 


oo 


£)Ai/(c) 


max > n(f,x,d) 


C^O 


d—c 


OO 


TAv(c)K 


"uAf) 



(11.4) 



(11.5) 



11.9.1 Lower Bounds for the HMM 

Before deriving bounds on the cost to do a variety of tasks in the HMM, we introduce the 
binary search problem. 

A binary tree is a tree in which each vertex has either one or two descendants except leaf 
vertices, which have none. (See Fig. 11.17.) Also, every vertex except the root vertex has one 



©©CD 

Figure 11.17 A binary search tree. 
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parent vertex. The length of a path in a tree is the number of edges on that path. The 
left (right) subtree of a vertex is the subtree that is detached by removing the left (right) 
descending edge. A binary search tree is a binary tree that has one key at each vertex. (This 
definition assumes that all the keys in the tree are distinct.) The value of this one key is larger 
than that of all keys in the left subtree, if any, and smaller than all keys in the right subtree, if 
any. A balanced binary search tree is a binary search tree in which all paths have length k or 
k + 1 for some integer k. 

LEMMA I 1.9.1 The length of the longest path in a binary tree with n vertices is at least \\og 2 {n-\- 
1)/21. 

Proof A longest path in a binary tree with n vertices is smallest when all levels in the tree 
are full except possibly for the bottom level. If such a tree has a longest path of length I, it 
has between 2 and 2 +1 — 1 vertices. It follows that the longest path in a binary search tree 
containing n keys is at least \\og 2 (n + l)/2] . ■ 

The binary search procedure searches a binary search tree for a key value V. It compares 
v against the root value, stopping if they are equal. If they are not equal and V is less than the 
key at the root, the search resumes at the root vertex of the left subtree. Otherwise, it resumes 
at the root of the right subtree. The procedure also stops when a leaf vertex is reached. 

We can now state bounds on the cost on the HMM for the logarithmic cost function 
v{a) = [log 2 a~\ . This function applies when the memory hierarchy is organized as a binary 
tree in which the low-indexed memory locations are located closest to the roots and the time 
to retrieve an item is proportional to the number of edges between it and the root. We use it 
to illustrate the techniques developed in the previous section. 

Theorem 11.9.1 states lower performance bounds for straight-line algorithms. Thus, the 
computation time is independent of the particular argument of the function / provided as 
input. Matching upper bounds are derived in the following section. (The logarithmic cost 
function is polynomially bounded.) 

THEOREM I 1.9.1 The cost function v(a) = |~log 2 a] on the HMM for the n x n matrix 
multiplication function f jf-l g realized by the classical algorithm, the n-point FFT associated with 
the graph F' ', n = 2 , comparison-based sorting on n keys / sort , and binary search on n keys, 
/g S , satisfies the following lower bounds: 

Matrix multiplication: ^"(/ixs) = ^( n3 ) 

Fast Fourier transform: /C„(i 7 '( d )) = il(nlognloglogn) 

Comparison-based sorting: K, v ( / s ™ t ) = fl (n log n log log n) 

Binary search: /C„(/ B g ) = f2(log n) 

Proof The lower bounds for the logarithmic cost function v(a) = |~log 2 a] use the fact 
that Ai'(a) = 1 when a = 2 for some integer k but is otherwise 0. It follows from (11.5) 
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that 

/ 

M/) = E*^ k (/) (1L6) 



k=\ 

It 



for the task characterized by /, where t satisfies 2 < N and N is the space used by task. 
N = 2n 2 for n x n matrix multiplication, N = n for the FFT graph F^ ', and N = n for 
binary search. 

In Theorem 11.5.3 it was shown that the number of I/O operations to perform n x n 
matrix multiplication with the classical algorithm is Q(n jyfrn). The model of this theorem 
assumes that none of the inputs are in the primary memory, the equivalent of the first m 
memory locations in the HMM. 

Since no charge is assessed by the U m (a) cost function for data in the first m memory 
locations, a lower bound on cost with this measure can be obtained from the lower bound 
obtained with the red-blue pebble game by subtracting m to take into account the first m 
I/O operations that need not be performed. 

Thus for matrix multiplication, K^u m {fAxB) = ^ ((^W™) — m )- Since 

(n /y/m\ — m > (vo — l)n /v8m 

when m < n 2 /2, it follows from (1 1.6) that K. v {f^ B ) = Sl(n 3 ) because J^k=o n 3 /2 k = 
ft(n 3 ). 

For the same reason, )Cu m (F ) — ^ ((nlogn) / logm — m) (see Theorem 11.5.5) 
and (nlogn/ log m) — m > nlogn/ (2 logm,) for m < n/2. It follows that K, v (F^ d >) 
satisfies 



°(? 



ME 



log(2 fc )^ 

log n x 

n log n 



k 

k=\ 



51 (n log n log log n) 



The last equation follows from the observation that 5^fc = i 1/& ' s closely approximated by 
J l j dx, which is In p. (See Problem 1 1 .2.) 

The lower bound for comparison-based sorting uses the Cl(n log n/ log m) sorting lower 
bound for the BTM with a block size B = 1 . Since the BTM assumes that no data are res- 
ident in the primary memory before a computation begins, the lower bound for the HMM 
cost under the U m cost function is fl ( (n log n/ log m) — m). Thus, the FFT lower bound 
applies in this case as well. 

Finally, we show that the lower bound for binary search is JCu m (/bs ) = ^(log^ — 
logm,). Each path in the balanced binary search tree has length d = [log(n + l)/2] or 
d — 1. Choose a query path that visits the minimum number of variables located in the first 
m memory locations. To make this minimum number as large as possible, place the items 
in the first m memory locations as close to the root as possible. They will form a balanced 
binary subtree of path length I = [log 2 (m + l)/2] or / — 1. Thus no full path will have 
more than / edges and / — 1 variables from the first m memory locations. It follows that 
there is a path containing at least d— 1 — (I— 1) = d— I = [log(n + 1)] — [log(m +1)] 
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variables that are not in the first m memory locations. At least one I/O operation is needed 
per variable to operate on them. It thus follows that 

log n 

M/bs)= I>(logn-log(2 d )) 

log n 

= Y^ ft(logn-d) 

d=0 

= fi(log n) 

The last inequality is a consequence of the fact that log n — d is greater than (log n)/2 for 
d< (logn)/2. ■ 

Lower bounds on the I/O complexity for these problems can be derived for a large variety 
of cost functions. The reader is asked in Problem 1 1 .20 to derive such bounds for the cost 
function v(a) = a a . 

11.9.2 Upper Bounds for the HMM 

A natural question in this context is whether these lower bounds can be achieved. We al- 
ready know from Theorems 11.5.3 and 11.5.5 that for each allocation of memory to each 
memory-hierarchy level, it is possible to match upper and lower bounds on the number of I/O 
operations and computation time. As a consequence, for each of these problems near-optimal 
solutions exist for any cost function on memory accesses for these problems. 

11.10 Competitive Memory Management 

The results stated above for the hierarchical memory model assume that the user has explicit 
control over the location of data, an assumption that does not apply if storage is allocated by an 
operating system. In this section we examine memory management by an operating system 
for the HMM model, that is, algorithms that respond to memory requests from programs to 
move stored items (instructions and data) up and down the memory hierarchy. We examine 
offline and online memory management algorithms. An offline algorithm is one that has 
complete knowledge of the future. Online algorithms cannot predict the future and must act 
only on the data received up to the present time. 

We use competitive analysis, a type of analysis not appearing elsewhere in this book, to 
show that the two widely used online page-replacement algorithms, least recently used (LRU) 
and first-in, first-out (FIFO), use about twice as many I/O operations as does MIN, the opti- 
mal offline page-replacement algorithm, when these two algorithms are allowed to use about 
twice as much memory as MIN. Competitive analysis bounds the performance of an online 
algorithm in terms of that of the optimum offline algorithm for the problem without knowing 
the performance of the optimum algorithm. 

Virtual memory-management systems allow the programmer to program for one large 
virtual random-access memory, such as that assumed by the HMM, although in reality the 
memory contains multiple physical memory units one of which is a fast random-access unit 
accessed by the CPU. In such systems the hardware and operating system cooperate to move 
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data from secondary storage units to the primary storage unit in pages (a collection of items). 
Each reference to a virtual memory location is checked to determine whether or not the refer- 
enced item is in primary memory. If so, the virtual address is converted to a physical one and 
the item fetched by the CPU. If not (if a page fault occurs), the page containing the virtual 
address is moved into primary memory and the tables used to translate virtual addresses are 
updated. The item at the virtual address is then fetched. To make room for the newly fetched 
page, one page in the fast memory is moved up the memory hierarchy. 

A page-replacement algorithm is an algorithm that decides which page to remove from a 
full primary memory to make space for a new page. We describe and analyze page-replacement 
algorithms for two-level memory hierarchies both because they are important in their own right 
and because they are used as building blocks for multi-level page-replacement algorithms. A 
two-level hierarchy has primary and secondary memories. Let the primary memory contain n 
pages and let the secondary memory be of unlimited size. 

The FIFO (first-in, first-out) page-replacement algorithm is widely used because it is sim- 
ple to implement. Under this replacement policy, the page replaced is the first page to have 
arrived in primary memory. The LRU (least recently used) replacement algorithm requires 
keeping for each page the time it was last accessed and then choosing for replacement the page 
with the earliest time, an operation that is more expensive to implement than the FIFO shift 
register. 

Under the optimal two-level page-replacement algorithm, called MIN, primary memory 
is initialized with the first n pages to be accessed. MIN replaces the page Pi in primary memory 
whose time ti of next access is largest. If some other page, pj, were replaced instead of pi, pj 
would have to return to the primary memory before Pi is next accessed, and one more page 
replacement would occur than is required by MIN. 

Implementing MIN requires knowledge of the future, a completely unreasonable assump- 
tion on the part of the operating system designer. Nonetheless, MIN is very useful as a standard 
against which to compare the performance of other page-replacement algorithms such as FIFO 
and LRU. 

11.10.1 Two-Level Memory-Management Algorithms 

To compare the performance of FIFO, LRU, and MIN, we characterize memory use by a 
memory-address sequence s = {s\, S2, ■ ■ ■ } of HMM addresses accessed by a computation. 
We assume that no memory entries are created or destroyed. We let -FfifoI 71 ' s )> ■ f7 L,Ru(' l > s )> 
and -Fmin(?i> s) be the number of page faults with each page-replacement algorithm on the 
memory address sequence s when the primary memory holds n pages. 

We now bound the performance of the FIFO and LRU page-replacement algorithms in 
terms of that of MIN. We show that if the number of pages available to FIFO and LRU 
is double the number available to MIN, the number of page faults with FIFO and LRU is 
at most about double the number with MIN. It follows that FIFO and LRU are very good 
page-replacement algorithms, a result seen in practice. 

THEOREM I I . I 0. 1 Let "fifo "lru> andn^m be the number of primary memory pages used 
by the FIFO, LRU, and MIN algorithms. Let "fifo > "MIN andni^j > "min- Then, for 
any memory-address sequence s the following inequalities hold: 

-f 1 FIFo(^FIFO. s ) < — ^MIn('T-MIN>s) + "MIN 

"-FIFO — "MIN + 1 
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FhR\j(riLRu, s) < — -Fmin(^min> s ) + ^min 

™LRU — "MIN + 1 

Proof We establish the result for FIFO, leaving it to the reader to show it for LRU. (See 
Problem 1 1.23.) Consider a contiguous subsequence t of s that immediately follows a page 
fault under FIFO and during which FIFO makes </> ° = / < "fifo page faults. In the 
next paragraph we show that at least / different pages are accessed by FIFO during t. Let 
MIN make <^ MIN faults during t. Because MIN has n M iN pages, </> MIN > / - n Mm + 1 > 
0. Thus, the ratio of page faults by FIFO and MIN is / '/ > MIN < //(/ - n Mm + 1). 

Let Pi be the page on which the fault occurs just before the start oft. To show that at 
least / different pages are accessed by FIFO during t, consider the following cases: a) FIFO 
faults on pi in t; b) FIFO faults on some other page at least twice in t; and c) neither case 
applies. In the first case, FIFO accesses at least 71fifo different pages because if it accessed 
fewer, then pi would still be in its primary memory the second time it is accessed. In the 
second case, the same statement applies to the page accessed multiple times. In the third 
case, FIFO can have only / faults if it accesses at least / different pages during t. 

Now subdivide the memory access sequence s into subsequences to, t\, . . . , tj~ such that 
tj, i > 1, starts immediately after a page fault under FIFO and contains npiFO faults and 
to contains at most tififo page faults. This set of subsequences can be found by scanning s 
backwards. Since MIN makes (j>j > "FIFO — "min + 1 faults on the j th interval, j > 1 , 
and 0^™ > </ , o' IFO — n MiN faults on the zeroth interval (that is, <$ lFO < <f>Q Lm +«min)> 
the number of faults by FIFO, Ffifo(™fifo, a) = $ IFO + c/>f IFO + ■ ■ ■ + ^f IFO satisfies 
the condition of the theorem because m < riFwofij /("FIFO — ^min + 1) for 

j > 1. ■ 

The upper bounds are almost best possible because, as stated in Problem 1 1 .24, for any 
online algorithm A there is a memory-access sequence such that the number of page faults 
Fa(s) satisfies the following lower bound: 

F A (n A ,s) > —rF MIN (n MW ,s) 

riA - "MIN + 1 

The difference between this lower bound and the upper bounds given for FIFO and LRU 
is riMINi which takes into account for the possibility that the initial entries in the primary 
memory of MIN and FIFO can be completely different. 

It follows that the FIFO and LRU page-replacement strategies are very effective strategies 
for two-level memory hierarchies. 



Problems 

MATHEMATICAL PRELIMINARIES 

11.1 Let a and b be integers satisfying 1 < a < b. Show that 6/2 < a [b/a\ < b. 
Hint: Consider values of b in the range ka < b < (k + I) a for k an integer. 

11.2 Derive a good lower bound on $Z&=i(V^) of the form fi(logm) using an approach 
similar to that of Problem 2.2. 
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PEBBLING MODELS 

1 1.3 Show that the graph of Fig. 1 1.2 can be completely pebbled in the three-level MHG 
with resource vector p = (2, 4) using only four third-level pebbles. 

1 1 .4 Consider pebbling a graph with the red-blue game. Suppose that each I/O operation 
uses twice as much time as a computation step. Show by example that a red-blue 
pebbling minimizing the total time to pebble a graph does not always minimize the 
number of I/O operations. 

I/O TIME RELATIONSHIPS 

1 1.5 Let S'min be the minimum number of pebbles needed to pebble the graph G = (V, E) 
in the red pebble game. Show that if in the MHG a pebbling strategy V uses Sfc pebbles 
at level k or less and Sj» > S'min + k — 1, then no I/O operations at level k + 1 or 
higher are necessary except on input and output vertices of G. 

11.6 The rules of the red-blue pebble game suggest that inputs should be prefetched from 
high-level memory units early enough that they arrive when needed. Devise a schedule 
for delivering inputs so that the number of I/O operations for matrix multiplication is 
minimized in the red-blue pebble game. 

THE HONG-KUNG LOWER-BOUND METHOD 

1 1.7 Derive an expression for the <S-span p(S, G) of the binary tree G shown in Fig. 1 1.4. 

1 1.8 Consider the pyramid graph G on n inputs shown in Fig. 11.18. Determine its S*-span 
p(S, G) as a function of S. 

11.9 In Problem 2.3 it is shown that every binary tree with k leaves has k — 1 internal vertices. 
Show that if t binary trees have a total of p pebbles, at most p — 1 pebbling steps are 
possible on these trees from an arbitrary initial placement without re-pebbling inputs. 
Hint: The vertices that can be pebbled from an initial placement of pebbles form a set 
of binary trees. 

11.10 An I/O operation is simple if after a pebble is placed on a vertex the pebble currently 
residing on that vertex is removed. Show that at most twice as many I/O operations are 
used at each level by the MHG when every I/O operation is simple. 




Figure 11.18 The pyramid graph. 
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Hint: Compare pebble placement with and without the requirement that placements 
be simple, arguing that if a pebble removed by a simple I/O operation is needed later it 
can be obtained by one simple I/O operation for each of the original I/O operations. 

TRADEOFFS IN THE MEMORY HIERARCHIES 

11.11 Using the results of Problem 11.8, derive good upper and lower bounds on the I/O 
time to pebble the pyramid graph of Fig. 11.18 in terms of n. 

11.12 Under the conditions of Problem 11.4, show that any pebbling of a DAG for convolu- 
tion of n-sequences with the minimal pebbling strategy when S > S m { n and n is large 
has much larger total cost than a strategy that treats blue pebbles as red pebbles. 

BLOCK I/O IN THE MHG 

11.13 Determine how efficiently matrix-vector multiplication can be done in the block-I/O 
model described in Section 1 1.6. 

11.14 Show that matrix-matrix multiplication can be done efficiently in the block-I/O model 
described in Section 11.6. 

SIMULATING FAST MEMORIES 

11.15 Determine conditions on a memory hierarchy under which the FFT can be executed 
efficiently in the standard MHG. Discuss the extent to which these conditions are likely 
to be met in practice. 

11.16 Repeat the previous problem for convolution realized by the algorithm stated in the 
convolution theorem. 

11.17 The definition of a minimal pebbling stated in Section 11.2 assumes that it is much 
more expensive to perform a high-level I/O operation than a low-level one. Determine 
the extent to which the lower bound of Theorem 11.4.1 depends on this assumption. 
Apply your insight to the problem of matrix multiplication ofnx n matrices in the 
three-level MHG in which s i < 3n 2 and si > 3n 2 . (See Theorem 11.5.3.) Determine 
whether increasing the number of level-3 I/O operations affects the number of level-2 
I/O operations. 

THE BLOCK-TRANSFER MODEL 

11.18 Derive a lower bound on the time to realize a permutation network on n inputs in the 
block-transfer model. 

Hint: Count the number of orderings possible between the n inputs. Base your argu- 
ment on the number of orderings within blocks and between elements in the primary 
memory, and the number of ways of choosing which block from the secondary memory 
to move into the primary memory. 

11.19 Derive a lower bound on the time to realize the FFT graph on n inputs in the block- 
transfer model. 

Hint: Use the result of Section 7.8.2 to argue that an n-point FFT graph cannot have 
many fewer vertices than there are switches in a permutation network. 
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THE HIERARCHICAL MEMORY MODEL 



11.20 Derive the following lower bounds on the cost of computing the following functions 
when the cost function is v{a) = a a : 



(n) 



{ n{n 2a + 2 ) if a > 1/2 

O(n 3 logn) if a = 1/2 



Matrix multiplication: ^^(Iaxb) = ' 

ft(n 3 ) if a < 1/2 

Fast Fourier transform: /c£ n) (F^ ) = 0(n a+1 ) 

Binary search: ^(/bs ) = ^(^ a ) 

Hint: Use the following identity to recast expressions for the computation time: 

n n—l 

]T Ag(k)h(k) = -J2 Ah(k)g(k + 1) + g(n + l)h(n) - g(l)h(l) 
fc=i fe=i 

11.21 A cost function v(a) is polynomially bounded if for some K > 1 and all a > 1. 

v(2a) < Kv{a). Let the cost function v{a) be polynomially bounded. Show that 
there are positive constants c and d such that v{a) < car. 

1 1 .22 Derive a good upper bound on the cost to sort in the HMM with the logarithmic cost 
function [log a] . 

COMPETITIVE MEMORY MANAGEMENT 

11.23 By analogy with the proof for FIFO in the proof of Theorem 11.10.1, consider any 
memory-address sequence s and a contiguous subsequence t of s that immediately 
follows a page fault under LRU and during which LRU makes <f> RU = f < ulru 
page faults. Show that at least / different pages are accessed by LRU during t. 

1 1 .24 Let A be any online page-replacement algorithm that uses Ua pages of primary memory. 
Show that there are arbitrarily long memory-address sequences s such that the number 
of page faults with A, Fa(s), satisfies the following lower bound, where timin is the 
number of pages used by the optimal algorithm MIN: 

F A {s) > — — F MIN {s) 

nA - "-MIN + 1 

Hint: Design a memory- address sequence s of length ua with the property that the 
first ha — JI-min + 1 accesses by A are to pages that are neither in As or MIN's primary 
memory. Let S be the ha + 1 pages that are either in MIN's primary memory initially 
or those accessed by A during the first ua — w-min + 1 accesses. Let the next jt-min — 1 
page accesses by A be to pages not in S. 
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Chapter Notes 

Hong and Kung [137] introduced the first formal model for the I/O complexity of problems, 
the red-blue pebble game, an extension of the pebble game introduced by Paterson and Hewitt 
[239]. The analysis of Section 11.1.2 is due to Kung [178]. Hong and Kung derived lower 
bounds on the number of I/O operations needed for specific graphs for matrix multiplication 
(Theorem 11.5.2), the FFT (Theorem 11.5.4), odd-even transposition sort and a number of 
other problems. Savage [295] generalized the red-blue pebble game to the memory-hierarchy 
game, simplified the proof of Theorem 1 1.4.1, and obtained Theorems 11.5.3 and 11.5.5 and 
the results of Section 11.3. Lemma 11.5.2 is implicit in the work of Hong and Kung [137]; 
the simplified proof given here is due to Agrawal and Vitter [9]. The results of Section 1 1.5.4 
are due to Savage [295]. 

The two-level contiguous block-transfer model of Section 11.8.1 was introduced by Savage 
and Vitter [296] in the context of parallel space-time tradeoffs. The analysis of sorting of 
Section 11.8.1 is due to Agrawal and Vitter [9] . In this paper they also derive similar bounds 
on the I/O time to realize the FFT, permutation networks and matrix transposition. 

The hierarchical memory model of Section 11.9 was introduced by Aggarwal, Alpern, 
Chandra, and Snir [7] . They studied a number of problems including matrix multiplication, 
the FFT, sorting and circuit simulation, and examined logarithmic, linear, and polynomial 
cost functions. The two-level bounds of Section 11.10 are due to Sleator and Tarjan [311]. 
Aggarwal, Alpern, Chandra, and Snir [7] extended this model to multiple levels. The MIN 
page-replacement algorithm described in Section 11.10 is due to Belady [35]. 

Two other I/O models of interest are the BT model and the uniform memory hierarchy. 
Aggarwal, Chandra, and Snir [8] introduced the BT model, an extension of the HMM model 
supporting block transfers in which a block of size b ending at location x is allowed to move 
in time f(x) + b. They establish tight bounds on computation time for problems including 
matrix transpose, FFT, and sorting using the cost functions [log x~\ , x, and x a for 1 < a < 1. 

Alpern, Carter, and Feig [18] introduced the uniform memory hierarchy in which the 
uth memory has capacity ap , block size p u , and time p u / (3(u) to move a block between 
levels; /3(w) is a bandwidth function. They allow I/O overlap between levels and determine 
conditions under which matrix transposition, matrix multiplication, and Fourier transforms 
can and cannot be done efficiently. 

Vitter and Shriver [354] have examined three parallel memory systems in which the mem- 
ories are disks with block transfer, of the HMM type, or of the BT type. They present a 
randomized version of distribution sort that meets the lower bounds for these models of com- 
putation. Nodine and Vitter [232] give an optimal deterministic sorting algorithm for these 
memory models. 
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CHAPTER 



VLSI Models of Computation 



The electronics revolution initiated by the invention of the transistor by Schockley, Brattain, 
and Bardeen in 1947 accelerated with the invention of the integrated circuit in 1958 and 1959 
by Jack Kilby and Robert Noyce. An integrated circuit contains wires, transistors, resistors, 
and other components all integrated on the surface of a chip, a piece of semiconductor material 
about the size of a thumbnail. And the revolution continues. The number of components that 
can be placed on a semiconductor chip has doubled almost every 18 months for about 40 years. 
Today more than 10 million of them can fit on a single chip. Integrated circuits with very large 
numbers of components exhibit what is known as very large-scale integration (VLSI). This 
chapter explores the new models that arise as a result of VLSI. 

As the size of the electronic components decreased in size, the area occupied by wires 
consumed an increasing fraction of chip area. In fact, today some applications devote more 
than half of their area to wires. In this chapter we examine VLSI models of computation 
that take this fact into account. Using simulation techniques analogous to those employed in 
Chapter 3, we show that the performance of algorithms on VLSI chips can be characterized 
by the product AT 2 , where A is the chip area and T is the number of steps used by a chip 
to compute a function. We relate AT 2 to the planar circuit size C Pi q(J) of a function /, a 
measure that plays the role for VLSI chips that circuit size plays for FSMs. The AT measure 
is the direct analog of the measure Cq(<5, X)T for the finite-state machine that was introduced 
in Chapter 3, where Cq{5, A) is the size of a circuit to simulate the next-state and output 
functions of the FSM. We also relate the measure A 2 T to C Pi n(/)- 
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Models of Computation 



12.1 The VSLI Challenge 



The design of VLSI chips represents an enormous intellectual challenge akin to that of con- 
structing very large programs. They each involve the assembly of millions of elements, instruc- 
tions in the case of software, and electronic components in the case of chips. The design and 
implementation of VLSI chips is also challenging because it involves many steps and many 
technologies. In this section we provide a brief introduction to this process as preparation 
for the introduction of the VLSI models and algorithms that are the principal topics of this 
chapter. 

12.1.1 Chip Fabrication 

A VLSI chip consists of a number of conducting, insulating, and doped layers that are placed 
on a semiconductor substrate. (A doped layer is created on the surface of the substrate by 
infusing small concentrations of impurities into the semiconductor. This is called doping.) 
The layers are created using masks, templates with open regions through which ionizing radi- 
ation is projected onto the surface of the semiconductor. The radiation changes the chemical 
properties of a previously deposited photosensitive material so that the exposed regions can 
be washed away with a solvent. The material that is now exposed can be doped or removed. 
Doping is used to create transistors and wires. A removal step is used when a metallic layer has 
been previously deposited from which sections are to be removed, leaving wires. A chip may 
have several layers of wires separated by layers of insulating material in addition to the doped 
layers that form transistors and wires. The layout of a NAND gate is shown schematically in 
Fig. 12.1, in which the shadings of rectangles and annotations identify to a chip designer the 
types of materials used to realize the gate. 
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Figure 12.1 The schematic layout of a NAN D gate and its logical symbol. 
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Geometric design rules specify the amounts of overlap of and separation between metal and 
dopant rectangles that are needed to guarantee the desired electrical and electronic properties of 
a VLSI circuit. If wires are too thin, electrons, which move through them at very high speeds, 
can cause excess heating as well as dislodge atoms and create an open circuit (this is called 
metal migration), especially at points at which a wire bends to descend into a well created 
during chip fabrication. Similarly, if wires are too close, an error in registration of masks may 
cause short circuits between wires. Also, since transistors are constructed through the doping 
and overlaying of insulating and conducting materials, if the regions defining a transistor are 
too small, it will not behave as expected. 

The geometric design rules for a particular chip technology can be quite complex. For the 
purpose of analysis they are simplified into a few rules concerning the width and separation 
of rectangles, the amount of area required for contacts between wires on layers separated by 
insulation, and the size of the various rectangular regions that form gates and transistors. As 
suggested by this discussion, a VLSI chip is quasiplanar; that is, its components lie on a few 
layers, which are separated by insulation except where contacts are made between layers. 



12.1.2 Design and Layout 

Many tools and techniques have been developed to address the complexity of chip layout. 
Typically these tools and techniques use abstraction; that is, they decompose a problem into 
successively lower level units of increasing complexity. At each level the number of units in- 
volved in a design is kept small so that the design is comprehensible. 

The design of a VLSI chip begins with the specification of its functionality at the func- 
tional or algorithmic level. Either a function or an algorithm is given as the starting point. 
An algorithm is then produced and translated into a specification at the architectural level. 
At this level a chip is specified in terms of large units such as a CPU, random-access memory, 
bus, floating-point unit, and I/O devices. (The material of Chapters 3 and 4 is relevant at this 
level.) After an architectural specification is produced, design commences at the logical level. 
Here particular methods for realizing architectural units are chosen. For example, an adder 
could be realized either as a ripple or a carry-lookahead adder depending on the stated speed 
and cost objectives. (The material of Chapter 2 applies at this level.) 

At the gate level, the next level in the design process, a technology, such as NMOS and 
CMOS, is chosen in which to realize the transistors and wires. This involves specifications of 
widths for wires, the number of layers of metal, and other things. If new transistor layouts are 
used, their physics is often simulated to determine their electrical properties. 

At the next level, the layout level, a gate-level design is translated into physical positions for 
modules, gates, and wires. Often at this level a rough layout is produced manually, after which 
automatic routing and compaction algorithms are invoked to route wires between modules 
and squeeze out the unnecessary area. Space must be reserved on each layout for I/O pads, 
rectangular regions large enough to connect external wires. They serve as ports through which 
data is read and written. Because these wires and pads are very large by comparison with the 
wires on the chip, there is a practical limit on the number of I/O ports on a chip. A port can 
be both an input and an output port. 

Once a layout is complete it is usually simulated logically, that is, at the level of Boolean 
gates. Parts of it may also be simulated electrically, a much more time-consuming process given 
the much lower level of detail that it entails. 
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After a chip has been fabricated it is then tested. Because the testing process for a complete 
chip cannot be exhaustive, due to the number of configurations that are possible, subunits are 
often isolated and tested. Testing circuitry is often built into a chip to simplify the testing 
process. 

Because the design, layout, simulation, and testing of VLSI chips is complex and error 
prone, computer-aided design (CAD) tools have been developed. CAD is very large subject 
beyond the scope of this book. Instead, we limit our attention in this chapter to the perfor- 
mance of VLSI chips. 



12.2 VLSI Physical Models 



Of all the parameters that affect the performance of a VLSI chip, its area is one of the most 
important. Equally important are the width of and separation between wires, both of which 
are directly related to area. Area is important for two reasons. First, a larger area means a chip 
can have more computing elements and do more work. Also, more area means a chip can have 
more I/O ports to facilitate data movement on and off the chip. 

Unfortunately, the area of a chip has a practical limit due to imperfections that occur in the 
chip manufacturing process. A single very small piece of dust or a dislocation in the crystalline 
semiconductor substrate, each of which can be large by comparison with the dimensions of 
components, can destroy a chip. As a consequence, only a small fraction (the yield) of the 
chips resulting from a fabrication process work. The rest must be discarded. 

The yield of a chip is very sensitive to its size. If the number of faults per unit area is 
F, with very high probability a fault occurs if the area A of a chip exceeds 1/ ' F '. As F is 
reduced by improvements in the manufacturing process, the area of any one chip can increase. 
However, if F is fixed, so is the value of A at which an economical yield is possible. (F has 
not decreased much over time.) To make chip manufacture economical, dozens of chips are 
manufactured together on a circular wafer of 4 to 8 inches in diameter. The wafer is then sliced 
into individual chips. If the die size is chosen correctly, a fixed fraction of the chips on a wafer 
will work. The importance of testing becomes evident in light of these observations. 

Because the area of a chip has a practical upper limit, the width and separation of wires 
determine the number of components that can be placed on a chip. As mentioned above, the 
technology for chip manufacture places a lower limit on these parameters as well as the area of 
chip components. 

To simplify our modeling and analysis, we assume that the minimal width and separation 
of wires is A (the minimum feature size) and that each gate, memory cell, port, and pair 
of crossing wires has area A . There is no great loss in assuming a single number for wire 
width and separation and one number for the minimal area of components because in practice 
the width and separation of wires of different kinds and the area of components are all small 
multiples of common values. The only component for which these assumptions are weak is 
the pads for I/O ports, which are generally very much larger than A 2 . It is important to be 
cognizant of this fact in drawing conclusions. 

Since chips are quasiplanar, we assume that each chip has at most v > 1 layers on which 
wires can reside but that there is only one layer of gates. Also, since wires are rectangular, it 
is impractical for them to meet at angles that are not close to or 45 degrees. In fact, wires 
are usually rectilinear, that is, run horizontally and vertically. Thus, we assume that wires are 
rectilinear. 
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To complete the physical modeling of chips we recognize three types of transmission 
model, the synchronous, transmission-line, and diffusion models. The synchronous model 

assumes that one unit of time is needed to transmit a bit across a wire, independent of its 
length. This is a good model when the switching time of gates is large by comparison with 
the time to transmit data through a wire or when wires are short, a situation that prevails for 
most designs. When it does not prevail, the unit of transmission time can be increased so that 
it does apply. The transmission-line model assumes that the time to transmit a bit across a 
wire is proportional to its length (see Problems 12.1 and 12.2), whereas the diffusion model 
assumes it is quadratic in its length. The models apply to VLSI chip technologies at different 
wire lengths. The synchronous, transmission-line, and diffusion models apply to wires that are 
short, medium-length, and long, respectively. 

Although we do not examine energy consumption in this chapter, the type of gate used 
can have a large impact on the amount of energy consumed during a computation. NMOS 
transistors consume energy all the time, whereas CMOS transistors consume energy only when 
they change their state. 

When the area of I/O pads and gates are comparable, the placement of the pads on a VLSI 
chip can have a big impact on the area occupied by a chip. For example, if the chip realizes a 
tree and its n leaves (and their pads) are placed on the boundary of a convex region, as noted 
in Problem 12.3, the chip must have area proportional to nlogn. However, as shown in 
Section 12.5.1, when its leaves can be placed anywhere, there is a layout for a tree (known as 
the H-tree) that has area proportional to n. If the I/O pads are much larger than the gates, the 
impact of their placement is diminished. 



12.3 VLSI Computational Models 



We assume that a VLSI chip implements a finite-state machine instantiated as a clocked se- 
quential machine. (A chip could also model an analog computer rather than a digital one, a 
topic not discussed in this book.) Although every FSM is eventually realized from two-input 
gates, binary memory cells, and wires carrying binary values (see Section 3.1), chips are gener- 
ally designed around an aggregate model for data. That is, if operations are done on integers, 
the wires associated with an integer travel together on the chip surface. Although the time re- 
quired for an operation on data depends on the size of alphabet from which the data is drawn 
and on the complexity of the operation itself, we simplify the analysis by assuming that one 
unit of time is taken. A more sophisticated analysis takes these factors into account. 

To be concrete we let the states of an FSM be represented as tuples over a set X of binary 
6-tuples. We also assume that gates realize functions {h : X i— > X} and that memory cells 
hold one value of X . We recognize a logic circuit over the set X as the graph of a straight-line 
in which the operations are drawn from a basis {h : X 1 \— > X}. This model is used to study 
problems defined over non-binary alphabets, such as matrix multiplication and the discrete 
Fourier transform over rings. 

We continue to use the notation A for the minimum feature size of a VLSI chip even 
though we now allow data to be treated as values in the set X . When the set X is big, it will 
be important to make use of its size in accounting for the area occupied by wires and gates, an 
issue that we ignore in this chapter. 

Computation time in the synchronous model is the number of steps executed by a chip. 
This is the same measure of time used for finite-state machines. Computation time in the 
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other models is the elapsed time in seconds, which is approximated by the number of steps 
multiplied by the length of the longest step. This time is generally a function of the area of the 
chip and the problem for which the chip is designed. 

Another measure of time, but one that is given only a cursory examination, is the period 
P of a VLSI chip. This is the time between successive inputs to a pipelined chip, one designed 
to receive a new set of inputs while the previous inputs are propagating through it. Pipelining 
is illustrated in Section 12.5.1 on H-trees and Section 1 1.6 on block I/O. 

In this chapter we assume that VLSI chips compute a single function / : X n \— > X m , 
a perfectly general assumption that allows any FSM computation to be performed. While 
this allows the VLSI chip to be a CPU or a RAM, to convey ideas we limit our attention 
to functions that are simply defined, such as matrix multiplication and the discrete Fourier 
transform. 

The variables of the function computed by a VLSI chip are supplied via its I/O ports. A 
single port can receive the values of multiple variables but at different time instances. Also, 
the value of a variable can be supplied at multiple ports, either in the same time step or in 
multiple time steps. However, the outputs of a function computed by a chip are supplied once 
to an output port. As noted above, a port can be either an input or output port or serve both 
purposes, but not in the same time step. 

As with the FSM, we cannot allow either the time or the I/O port at which data is received 
as input or is supplied as output to be data-dependent. To do otherwise is to assume that an 
external agent not included in the model is performing computations on behalf of the user. 
We can expect misleading results if this is allowed. Thus, we assume that each I/O operation 
is where- and when-oblivious; that is, where an input or output occurs is data-independent, 
as are the times at which the I/O operations occur. 

For many VLSI computations it is important that the input data be read once by the 
chip even if it may be convenient to read it multiple times. (These are called semellective or 
read-once computations.) For example, if a chip is connected to a common bus it may be 
desirable to supply the data on which the chip operates once rather than add hardware to the 
chip to allow it to request external data. However, in other situations it may be desirable to 
provide data to a chip multiple times. Such computations are called multilective. Multilective 
computations must be where- and when-oblivious. 

If a multilective VLSI algorithm reads its n input variables /3/in times but only un times 
when multiple inputs of a variable (at multiple time steps) at one I/O port are treated as a 
single input, then the algorithm is (/3, fi) -multilective. 



12.4 VLSI Performance Criteria 

As stated in Theorem 7.4.1, the product pT p of the time, T p , and the number of processors, p, 
in a parallel network of RAM processors to solve a problem cannot be less than the serial time, 
T s , on a serial RAM with the same total storage capacity for that problem. Applying this result 
to the VLSI model, since the number of processors of any given size that can be placed on a 
chip of area A is proportional to A, it follows that the product AT of area with the time T 
for a chip to complete a task cannot be less than the serial time to compute the same function 
using a single processor; that is, AT = Cl(T s ). 

In the next section we show that the matrix-vector multiplication and prefix functions can 
be realized optimally with respect to the AT measure. This holds because these problems have 
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low complexity. For problems of higher complexity, such asnxn matrix-matrix multiplication, 
we cannot achieve ^LT-optimality because stronger lower bounds apply. In particular, both 
AT and AT must grow as n for this problem, as we show. AT, AT and AT are the only 
measures of VLSI performance considered in this chapter. 



12.5 Chip Layout 



In this section we describe and discuss layouts for a number of important graphs and problems. 
These include balanced binary trees, multi-dimensional meshes, and the cube-connected cycle. 

12.5.1 The H-Tree Layout 

H-trees are embeddings of binary trees that use area efficiently. Let H /. be an H-tree with 4 
leaves. Figure 12.2 shows the H-tree Hi with 16 darkly shaded squares that can be viewed 
either as subtrees or leaves. The lightly shaded regions are internal vertices of the binary tree. 
Leaves often perform special functions that are not performed by internal vertices whereas 
internal vertices of a tree often perform the same function. Each quadrant of the tree shown in 
Fig. 12.2 can be viewed as the H-tree H\ on four subtrees or leaves. 

The layout of H^ is recursively defined as follows: replace each of the four leaves of H^-i 
with a copy of H\. Thus, Hj in Fig. 12.2 is obtained by replacing each leaf in H\ with a copy 
of if J. 

We now derive an upper bound on the area of an H-tree under the assumption that each 
vertex is square, leaf vertices occupy area b 1 , and the separation between leaf vertices is c. If 
S(k) is the length of a side of Hk, then 5(1) = 2b+ c. Also, from the recursive construction 
of Hk the following recurrence holds: 

S(k) =25(>- !) + c 
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Figure 1 2.2 The H-tree Hi containing 16 subtrees (or leaves). 
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The solution to this recurrence is S(k) = (b + 6)2 — c as the reader can verify. Since 
.Hfc has n = 4 leaves and area A n = (S(k)) 2 , it follows that an n-vertex H-tree has area 
A n < n(b + c) 2 . 

To appreciate the importance of the H-tree construction, observe that its leaves are interior 
to the layout. Given the usual drawing of a binary tree one is tempted to place its leaves along 
the boundary of a chip. If this boundary is convex, the area of a binary tree on n leaves must 
be at least proportional to nlogn. (See Problem 12.3.) 



MATRIX-VECTOR MULTIPLICATION ON AN H-TREE We now describe an algorithm based on an 
H-tree that multiplies annxn matrix A with an n-vector x,n = 2 , by forming the n inner 
products of the n rows of A with x. (Matrix-vector multiplication is defined in Section 6.2.2.) 
This algorithm assumes that one unit of time is taken to store one piece of data and to perform 
an addition or multiplication on data. 

On the first time step of our algorithm the components of the vector x are supplied in 
parallel to the n leaves of the tree and stored there. On the second time step components of 
the first row of A are also provided in parallel to the leaves. In the third time step the product 
of corresponding components of x and the first row of A are multiplied. In k = log 7 n 
additional time steps these products are added in the H-tree and the result supplied as output. 
In the next two steps the second row of A is supplied as input and its components multiplied 
by those of x. After k additional steps these products are summed and the result generated 
as output. This process is repeated for each of the remaining rows of A. This algorithm is 
semellective. 

Since we treat the time to add and multiply as the basis for measuring the time required 
by this H-tree, each inner product requires O(logn) time and the n inner products require 
0(n log n) time. However, if each addition vertex in this tree can also store its result (thereby 
causing a slight increase in area), a new row of A can be supplied to the H-tree in each unit 
of time (we say the period of the computation is P = 1) because a series of partial results 
can move through the tree in parallel. This is an example of pipelining. In this case the time 
to perform the n inner products is 0(n + logn) = O(n). If pipelining is not used, this 
matrix- vector multiplication algorithm does not make the best use of area and time, as we now 
show. 

Even without pipelining there exists an AT optimal algorithm for matrix-vector multipli- 
cation. Let n be such that nj log 2 n is a power of 4. Decompose each row of A as well as x 
into (log 2 n)-tuples. This is equivalent to representing the n x n matrix A by a n x [nj log 2 n) 
matrix B whose entries are 1 x log 2 n matrices (equivalently, (log 2 n) -vectors) and to repre- 
senting x by an (nj log 2 n)-vector y whose components are (log 2 n)-vectors. 

We implement this computation on an H-tree with 0(n/ logn) area. To compute the 
inner product of A's jth row with x, sequentially supply to each H-tree leaf the components 
of one (log 2 n)-vector of y and the corresponding vector in the jth row of B. Supply the 
individual components of these (log 2 n)-vectors in alternate cycles. After a leaf vertex receives 
the corresponding components of A and x, it multiplies them and adds the result to its running 
sum. Upon completion of an inner product of two (log 2 n)-vectors, the leaf vertices make their 
values available to be added in the H-tree in 0(log n) steps. After n of these operations, all n 
inner products of Ax are computed. 

This algorithm uses T = 0(n\ogn) time but only has area A = 0(n/ logn). Thus, 
its area— time product satisfies AT = 0(n ), which is optimal since each of the n + n 
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components of A and x must be read. This algorithm is multilective because it supplies each 
component of x n times. 



PREFIX COMPUTATION ON AN H-TREE The H-tree is also an effective way to do a prefix com- 
putation. Prefix computations (let be the associative operator) are naturally executed on 
trees. A tree-based prefix computation is described in Problem 7.31. One datum enters the 
root of the tree; the rest travel up from the leaves. When implemented on an H-tree, this algo- 
rithm uses area O(n) on n inputs and time O(logn), giving an AT product of O(nlogn). 
This algorithm is semellective. 

This algorithm can be converted into an AT-optimal algorithm using a technique similar 
to that used above. We subdivide the input n-tuple x into (log 2 n)-tuples, of which there are 
(nj log 2 n), and serially form the associative combination of the (log 2 n) components of each 
tuple using in (log 2 n) steps. We then perform the prefix computation on these (nj log 2 n) 
results. To complete the computation, for 1 < j < (n/ log 2 n) — 1 we reread each of the 
original (log 2 n)-tuples in parallel and add the (j — l)st result (the zeroth result is 0) to the 
first component of the jth (log 2 n)-tuple, and then serially perform a prefix computation on 
these new (log 2 n)-tuples. 

We increase (nj log 2 n) to the next power of 4 (adding inputs whose corresponding out- 
puts are ignored) and embed the tree of Fig. 7.23 directly into an H-tree. The initial associative 
combination of (log 2 n) -tuples and the final prefix computation on (log 2 n) -tuples are done 
at vertices of the H-tree that are I/O vertices of the prefix tree. This algorithm takes time 
0(log n) on the initial and final phases as well as on the prefix computation. Since the area of 
the layout is 0(nj log 2 n) and every one of the n inputs must be read, its area-time product, 
AT, is 0(n) which is optimal. This algorithm is multilective since each input is supplied 
twice. 

12.5.2 Multi-dimensional Mesh Layouts 

As explained in Section 7.5, many important problems can be solved with systolic arrays. If 
the cells of one- and two-dimensional systolic arrays are of fixed size and quasiplanar, they can 
be embedded directly onto a chip with area proportional to the number of cells. Applying the 
results of Theorems 7.5.1, 7.5.2, and 7-5.3 we have the following facts concerning the area and 
time for three important problems when realized by such arrays. 



Problem 


Dimensions 


Area 


Time 


n x n Matrix- Vector Multiplication 


ID 


0(n) 


0(n) 


Bubble Sort of n items 


ID 


0(n) 


0(n) 


Batcher's Odd-Even Sorting of n items 


ID 


0(n) 


0(n) 


y/n x yfn Matrix-Matrix Multiplication 


2D 


0(n) 


0(y/n) 



Fully normal algorithms for problems such as shifting, summing, broadcasting, and fast 
Fourier transform on n = 2 inputs can each be done in 0(log n) steps on the n-vertex hy- 
percube or the canonical cube-connected cycles network on n vertices. From Theorems 7.7.4 
and 7.7.5 these problems can also be solved in 0(n) and 0(y/n) steps, respectively, on n- 
vertex one- and two-dimensional systolic arrays. We summarize these facts in Figure 12.3. 
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Problem 


Dimensions 


Area 


Time 


Shifting of n-vector 


ID 
2D 


0(n) 
0(n) 


0(n) 
0(y/n) 


Summing n items 


ID 
2D 


0(n) 
0(n) 


0(n) 


Broadcasting to n locations 


ID 
2D 


0(n) 

0(n) 


0(n) 
0(Jn) 


n-point FFT 


ID 
2D 


0(n) 
0(n) 


0(n) 
0{y/n) 



Figure 1 2.3 Area vs. time performance of VLSI algorithms for four problems. 



In Section 12.6 we show that shifting of an n-vector, the n-point FFT, and n x n matrix- 
matrix multiplication each require area A and time T satisfying AT 2 = fl(n 2 ). Consequently, 
the 2D algorithms cited above for these problems are optimal to within a constant factor. 

In the next section we now show that every normal algorithm can be implemented on 
the cube-connected cycles (CCC) network in time T satisfying Q(logn) < T < 0(\/n) 
and that the CCC network can be embedded in the plane using area A = 0(n 2 /T 2 ). In 
Theorems 12.7.2 and 12.7.3 we show that these implementations are optimal up to constant 
multiplicative factors with respect to area and time for the three problems mentioned above. 



12.5.3 Layout of the CCC Network 

In Section 7.7.6 we describe the realization of a fully normal algorithm on the canonical CCC 
network. The realization extends directly from the canonical CCC network to a general (k,d)- 
CCC network in which there are 2 cycles and 2 vertices on each cycle. (See Fig. 12.4.) 

A fully normal algorithm is simulated on the CCC network by giving the processors on 
the jth cycle, < j < 2 d — 1, the addresses i + j2 k where < i < 2 k — 1. The cycles 
are treated as ID arrays and used to simulate a normal algorithm on the first k dimensions 
exactly as is done in Section 7.7.6. These simulations are done in parallel after which the 
swaps across the higher-order d dimensions are simulated by first rotating the leading element 
on each cycle to the first of the inter-cycle edges. After executing one swap, each cycle is 
advanced one step so that the second elements on each cycle are aligned with the first of the 
high-order dimensions. At this point the first elements on each cycle are aligned with the edge 
associated with the second of the high-order dimensions. Thus, while swaps are done between 
the second elements on each cycle across the first of the high-order dimensions, swaps occur 
between leading elements along the second of the high-order dimensions. This rotating and 
swapping is done until all cycle elements have been swapped across all high-order dimensions. 

This algorithm performs 0(2 ) steps on the cycles to perform swaps across low-order 
dimensions and align the cycles for swaps at higher dimensions. An additional 0(d) steps are 
used to perform swaps on the d high-order dimensions. Thus, the number of steps used by 
this algorithm, T, satisfies T = 0(2 + d). The number of processors used in (k, d)-CCC 
network, n, satisfies n = 2 d+k . 
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Figure 1 2.4 An embedding of a (fc, d)-CCC network in the plane for k = 3 and d = 4. The 
2 columns represent cycles of length 2 > d. For 1 < j ' < d, the j'th vertex on each cycle is 
connected to the Jth vertex on another cycle. 



Figure 12.4 shows a layout of a (3, 4)-CCC network. A layout for a general (k, d)-CCC 
network, 2 > d, can be developed following this pattern. Place each cycle of length 2 in 
a column. Use 2—1 rows to make connections between columns. These rows are divided 
into d sets. The first set, consisting of one row, connects adjacent columns. The second 
set, containing two rows, connects every other column. The jth set, containing 7?~ rows, 
connects every 2 J th column. The number of rows used for these connections is 1 + 2 + 4 + 
■ ■ • + 2 = 2 — 1. Since d processors are used in each column to make these connections, 
each column contains 2 — d > processors not connected to other columns. (These are 
suggested by the lightly shaded vertices.) It follows that this layout has 2 + 2 — (d + 1) rows 
and 2 +1 columns. If a wire is assumed to have the same width as a processor, the layout has 
area^ = 2 d+1 (2 d + 2 fc - (d + 1)). 

Recall that n = 2 d+k and 2 k > d or k > log 2 d. It follows that T = @(2 k + d) = 9(2 fc ). 
Since k > log 2 d,T = Q(d) = ft (log n). Also, when k < d, 2 2k < n and T < 0{s/n). We 
summarize this result below. 



THEOREM 12.5.1 Every fully normal algorithm for a n-processor hypercube can be implemented 
on a CCC network whose VLSI layout has area A and uses time T satisfying the following bound 
forVt{\ogn) <T=0(v/n). 

AT 2 = 0(n 2 ) 
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This result can be applied to any of the fully normal algorithms described in Section 7.6 
and the Benes permutation network discussed in Section 7.8.2. 

12.6 Area-Time Tradeoffs 

The AT 2 measure encountered in the last section is fundamental to VLSI computation. This 
is established by deriving a lower bound on AT 2 in terms of the planar circuit complexity, 
Cp.fi (/)> of the function / computed by a VLSI chip of area A in T steps. A similar result is 
derived for the product A 2 T. The planar circuit size of/ is the size of the smallest memoryless 
planar circuit for /. The measures AT and A T are the sizes of two different memoryless 
planar circuits that compute the same mapping from inputs to outputs as a VLSI chip of area 
A that executes T steps. 

12.6.1 Planar Circuit Size 

We now formally define planar circuit size and show how it relates to the standard circuit size 
measure. 

DEFINITION 12.6.1 A planar circuit over the set X is a logic circuit over the set X that has been 
embedded in the plane in such a way that gates do not overlap but edges may cross. A planar circuit 
is semellective if there is a unique vertex at which each input variable is supplied. Otherwise, the 
planar circuit is multilective. 

The size of a planar circuit is the number of inputs, edge crossings, and gates drawn from 
a basis O = {h : X \— > X} that the circuit contains. The planar circuit size of a function 
f : X n i— > X m over 0, C P) n(/)> is the size of the smallest planar circuit for f over the basis Q. 

A multilective circuit of order \x, fx > 1, for a function f : B n *— > B m has fin input vertices. 
The size of the smallest multilective planar circuit of order nfor f is denoted Cj q (/) . If the planar 

circuit is semellective, the planar circuit size of f is denoted C^^f) or C p _n(/) when confusion 
is not likely. 

Every binary function has a planar circuit. To see this, observe that every function has a 
circuit, which is a graph, and that every graph has a planar embedding with edge crossings. 
The planar circuit size of a function is at worst quadratic in its standard circuit size, as we now 
show. 

LEMMA 12.6.1 The (multilective) planar circuit and standard size of f : B n i— > B m relative to 
the basis Q are in the following relationship where r is the fan-in of Q. 

Cn(f) + n< C p ,n(/) < r 2 C 2 n (/)/2 + <?„(/) + n 

Proof The first inequality follows because the planar circuit size measure includes inputs, 
crossings, and gates, whereas the circuit size measure includes only gates. 

Consider an embedding of a standard circuit for / containing Cfj(/) gates. In such 
an embedding it is not necessary for any two edges to intersect more than once because if 
they violate this condition the edge segments between any two successive crossings can be 
swapped so that these two crossings can be eliminated. Since every gate has at most r inputs, 
a minimal standard circuit for / has at most rCn(f) edges connecting gates. It follows that 
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Figure 1 2.5 Two simulations of a T-step VLSI chip computation by a planar circuit. 



the number of crossings does not exceed r 2 Co(/) 2 /2 because there are at most (*) ways of 
forming pairs drawn from a set of size q and q = rCfi(f). Combining this with the number 
of inputs and gates, we have the desired upper bound. ■ 

In Section 12.7 we show that / cyc i; c nearly meets the upper bound of Lemma 12.6.1. That 
is, the planar circuit size of this function is nearly quadratic in its standard circuit size. 

12.6.2 Computational Inequalities 

We now show that every VLSI chip computation can be simulated by planar circuits of size 
0(AT 2 ) and 0(A 2 T). The simulation is patterned on the simulations of Chapter 3; that is, 
the loop that constitutes the computation by the chip with memory is unwound to create a 
planar circuit. Instead of passing the outputs of the next-state/output circuit to binary memory 
cells they are passed to another copy of the circuit. 

Figure 12.5 shows two simulations of a T-step VLSI chip computation by a planar circuit. 
The first is obtained by placing T copies of the chip one above the other and supplying the 
state output of one copy to the state input of the next copy. The second is simulated by placing 
T copies of the chip side by side and running wires from the state output of one chip to the 
state input of the next. We convert each of these memoryless circuits to planar circuits and 
bound the number of inputs, crossings and gates they contain. Recall that we assume that 
wires are rectilinear; that is, they run only horizontally and vertically. 

Since the number of wire layers on a single chip is bounded, it does not hurt to assume 
that the centerlines of parallel wires on different planes are displaced slightly. (It is bad practice 
to overlap wires because one wire can induce currents in the other.) Now make the width of 
wires and the area of gates infinitesimal. (Wires are shrunk to their centerline.) As shown in 
Fig. 12.6(a), each two-input gate is replaced by an infinitesimal vertex connected by a straight- 
line to its output and the two connections from its inputs are made by wires that contain bends 
(two wires touch). This converts a single chip to a planar graph with wires that touch or cross. 
(See Fig. 12.6(b) and (c)). 

We now bound n w , the number of wires, and n g , the number of gates on a chip of area 
A. Since each wire has width A and length at least A and each gate occupies area A 2 , n w and 
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(b) 



(c) 



Figure 1 2.6 (a) The result of shrinking a physical gate to a point, (b) A crossing of two wires, 
and (c) four types of connection between two wires. 



satisfy the following bounds. 



n w < A/X 2 
rig < A/X 2 



Because each point of crossing or touching of wires occupies area at least A 2 , the number 
of points at which wires cross and touch on each of the v layers of a chip that has area A is 
at most A/X 2 . As shown in Fig. 12.6(a), when gates are made infinitesimal two additional 
bends are created at the point at which the output wire touches the gate. This can be viewed 
as adding four wire bends per gate. Since the number of gates is at most A/X 2 , we have the 
following bound on n cr , the number of wire crossings and touchings. 

n cr < {v + 4)A/X 2 

Consider the first of the two simulations. T layers of one chip are placed one above the 
other. To expose overlapping wires, displace all layers to the northeast by an infinitesimal 
amount. Every pair of wires that cross or meet has the potential to introduce crossings, as 
suggested in Fig. 12.7(a) and (b). The maximum number of crossings that can be introduced 
per touching or crossing of wires is T 2 . Since the number of input vertices is O(AT), this 
provides an upper bound of 0(AT ) on the number of inputs, gates, and crossings of the 
resultant planar circuit. 

Now consider the second simulation. T copies of one chip are laid side-by-side and the 
layout of each chip opened and at most n w parallel wires inserted to make connections to 
adjacent chips. Since there are n w wire segments on a single chip, at most n w new wire 
crossings are introduced on one chip. Thus, the number of inputs, gates, and crossings in this 
layout is 0(AT + n 2 w T) = 0{A 2 T), 

The following theorem, which is an application of Theorem 3.1.1 to the VLSI model, 
summarizes the above results. It makes use of the fact the planar circuit size of a function 
/ computed by a VLSI chip of the kind described above is no larger than that of the planar 
circuits just constructed. This theorem demonstrates the importance of the measures AT 2 and 
AT as characterizations of the complexity of VLSI computations. It also shows that lower 
bounds on the performance of VLSI chips can be obtained in terms of the planar circuit size 
of the functions computed by them. 
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Figure 12.7 Crossings obtained by translating infinitesimally to the northeast T copies of (a) 
one crossing and (b) the four possible connections between two wires. 



p(T) 



THEOREM 1 2.6. 1 Let y^ L ' be the function computed by a VLSI chip that realizes the FSM M 

{h : X i— > X} of any function f computed 



in T steps. The planar circuit size over a basis ft = 
by M in T steps satisfies the following inequalities: 

Cp.no (/) 

C P ,n„(/) 



0(AT 2 ) 
0{A 2 T) 



If M is multilective of order fi, then C p ,n(/) is replaced by C^ q(/). 

It is important to note that these relationships between planar circuit size and the mea- 
sures AT 2 and A 2 T hold for all functions computed by VLSI algorithms, both multi-output 
functions and predicates. 

In the next section we develop the planar separator theorem that is used in the next section 
to derive lower bounds on the planar circuit size of important problems. 

12.6.3 The Planar Separator Theorem 

The planar separator theorem applies to graphs G = (V,E) for which a non-negative cost 
function c is defined on V. The cost of V, denoted, c(V), is the sum of the costs of every 
vertex in V. The theorem states that the vertices of every planar graph G on N vertices can be 
partitioned into three sets, A, B, and C such that no edge connects a vertex in A with one in 
B, the cost of vertices in A, c(A), and those in B, c(B), satisfy c(A), c(B) < 2c(V)/3 and 
C contains at most 4\A/V vertices. 

The following lemma uses the concept of the spanning tree of a graph, a tree that contains 
every vertex of a connected graph G. It shows the existence of a cycle that divides a planar graph 
into an "inside" and an "outside" containing about the same number of vertices. The radius 
of a rooted spanning tree is the number of edges on the longest path from the root to a vertex. 
(See Problem 12.8 for an illustration of the following lemma.) 
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LEMMA 12.6.2 Let G = (V, E) be a finite connected planar graph. Let c be a non-negative 
cost function defined on V and let c(V) be the total cost of all vertices in V . Lfi G has a rooted 
spanning tree of radius r, then V can be partitioned into sets A, B, and C such that c(A),c(B) < 
2c(V)/3, no edge joins a vertex of A with one of B, and C contains at most 2r + 1 vertices. 

Proof Since the lemma is true if the cost of any vertex exceeds 1/3, assume the converse. Let 
G = (V, E) be embedded in the plane. A face of a planar graph is a region bounded by 
vertices and edges that does not contain any other vertices and edges. The external face of a 
finite planar graph is the face of unbounded area. Since G is finite, it has an external face. A 
triangular planar graph is a planar graph in which each face is a triangle. If a planar graph 
is not triangular, it can be made triangular by choosing one vertex on the boundary of each 
face and adding an edge between it and every other vertex on this face to which it does not 
already have an edge. Without loss of generality we assume that G is triangular. 

Let T be the spanning tree of radius r postulated in the lemma. Each edge e in E not 
on T defines a unique cycle £(e) of length at most 2r + 1. The cycle divides V into three 
sets, vertices on £(e), and vertices on each side of £(e). Let Ci(e) and 02(e) be the cost of 
vertices on either side. (The side with the larger cost is called the inside of the cycle.) We 
claim that for some e not on T the larger of C\(e) and 02(e) is more than 2c(V)/3. We 
suppose the larger is no more than 2c(V)/3 and establish a contradiction. 

Let e = (x, y) be an edge not on T such that fi(e) = max(ci (e), 02(e)) is smallest and 
for all other e* such that /i(e*) = fi(e) the inside of £(e) has the fewest faces. In case of 
ties, let e be chosen arbitrarily. We show the assumption that /i(e) > 2c(V)/3 is false. 

Consider the triangle containing the edge e = (x, y) on the side of the cycle £(e) that 
has largest cost. Let z be the third vertex in this triangle, z is on the spanning tree because 
every vertex is on the tree. We consider two cases for z: (a) either edge (x, z) or (y, z) is in 
T and (b) neither edge is in T. 

In case (a) without loss of generality, let (y, z) be in T. There are two subcases to 
consider: (al) z is on £(e) (see Fig. 12.8(a)) and (a2) it is not on £(e) (see Fig. 12.8(b)). In 
(al) the edge e' = (x, z) cannot be a tree edge since T contains no cycles unless the cycle 
consists of just the vertices x, y, and z, which is impossible since the inside of £(e) contains 




(a) (b) (c) 

Figure 12.8 A non-tree edge e = (x,y) in a triangular planar graph with spanning tree T 
defines a cycle £(e). The triangle containing e on the larger side of £(e) contains a third vertex 
z. In (a) and (b) {y, z) is on T, whereas in (c) neither (x, z) nor (j/, z) is on T. In (a) (y, z) is 
on £(e), whereas in (b) it is not. 
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at least one vertex. But £(e') includes the same set of vertices of V inside it (and has the 
same cost) as does £(e), although it has fewer faces, contradicting the choice for e = (a;, y). 

In case (a2) the edge e' = (x, z) is a non-tree edge since T contains no cycles. The inside 
of £(e') contains no more cost and one less face than £(e). If the cost inside £(e') is greater 
than the cost outside, e' would have been chosen instead of e. On the other hand, if the 
cost inside £(e') is at most the cost outside, since the latter is equal to the cost outside £(e), 
which is at most c(V)/3, the cost inside £(e') is at most c(V)/3. However, this contradicts 
the assumption that /i(e*) > 2c(V)/3 for all edges e*. 

Consider the case (b) in which neither edge (x, z) nor (y, z) is in T. (See Fig. 12.8(c).) 
The edges (x, z) and (y, z) each define a cycle contained within £(e). Without loss of gen- 
erality assume that the cycle defined by (x, z) has more cost on the inside of £(e) than does 
the cycle defined by (y, z). Because the cost of vertices on the inside of the original cycle is 
more than 2c(V)/3, the cost inside and on £((x, z)) is more than c(V)/3. Thus, the cost 
outside £ ((x, z)) is less than or equal to 2c(V)/3. If the cost inside £ ((x, z)) is also less 
than or equal to 2c(V)/3, we have a contradiction. If greater than 2c(V)/3, £,((x, z)) is a 
cycle with fewer faces for which fJ,((x, z)) > 2c( V)/3, another contradiction. ■ 

The following theorem uses Lemma 12.6.2 together with a spanning tree constructed 
through a breadth-first traversal of a connected planar graph to show the existence of a small 
separator that divides the vertices into approximately two equal cost parts. 

THEOREM 12.6.2 Let G = (V,E) be an N -vertex planar graph having non-negative vertex 
costs summing to c(V). Then, V can be partitioned into three sets, A, B, andC, such that no edge 
joins vertices in A with those in B, neither A nor B has cost exceeding 2c(V)/3, and C contains 
no more than AyN vertices. 



Proof We assume that G is connected. If not, embed it in the plane and add edges as 
appropriate to make it connected. Assume that it has been triangulated, that is, every face 
except for the outermost is a triangle. 

Pick any vertex (call it the root) and perform a breadth-first traversal of G. This traversal 
defines a BFS spanning tree T of G. A vertex v has level d in this tree if the length of the 
path from the root to v has d edges. There are no vertices at level q where q is the level one 
larger than that of all vertices. Let Rd be the vertices at level d and let r^ = \Rd\- 

The reader is asked to show that there is some level m such that the cost of vertices 
at levels below and above m each is at most c(V)/2. (See Problem 12.9.) Let I and h, 
I < m < h, be levels closest to m that contain at most v N vertices. That is, rj < v N and 
fh < v N . There are such levels because level contains a single vertex and there are none 
at level q. 

The vertices in G are partitioned into the following five sets: a) L = Ud<Z Rd> b) Ri, 
c) M = Uz<d<ft,^- d ' d) Rh> and e) H = U/i<d^ d- Since L and H are subsets of the 
sets of vertices with levels less than and more than m, c(L),c(H) < c(V)/2. Also, by 
construction, ri,Th < vN. If Ri = Rh = R m (which implies that M is empty and 
I = h = m), let A = L, B = H, and C = Ri = Rh- Then, C is a separator of size at 
most v N and the theorem holds. If / 7^ h, then h — I — 1 > 0. Since each of the h — l — 1 
levels between I and h has at least yN + 1 vertices, it follows that h — I — 1 < V N — 1 
because these levels cannot have more than N — 1 vertices altogether. 

Consider the subgraph of G consisting of the vertices in M and the edges between them. 
Add a new vertex vq to replace the vertices in L U Ri and add an edge from Vq to each of 
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the vertices at level I + 1 . This operation retains planarity and the resulting graph remains 
triangulated because adjacent vertices on Ri+i have an edge between them. Also, it defines a 
spanning tree T* consisting of Vq, the new edges, and the projection of the original spanning 
tree to the vertices in M. T* has radius at most y N . 

Apply Lemma 12.6.2 to T* giving v Q zero cost. This lemma identifies three sets of 
vertices, Aq, Bq and Co, from which we delete vq and adjacent edges. Since c(M ) < c(V), 
it follows that there are no edges between vertices in Aq and Bq, c(A ), c(Bq) < 2c(V)/3, 
and | C 1 < iVN. Let C = C U Ri U R h . It follows that \C\ < WN. 

Each of the four sets A , Bq, L, and H has cost at most 2c(V)/3. If any one of them 
has cost more than c(V)/3, let it be A and let B be the union of the remaining sets. If none 
of them has cost more than c(V)/3 vertices, order the sets by size and let A be the union of 
the fewest of these sets whose cost is at least c(V)/3 vertices. This procedure insures that A 
has cost between c(V)/3 and 2c(V)/3 which implies that B satisfies the same condition as 
A and the theorem is established. ■ 

The preceding version of the planar separator theorem only guarantees that the vertices of a 
planar graph are divided into two sets whose costs are nearly balanced and a small separator. It 
does not insure that the number of vertices in the two sets are balanced. The following lemma 
remedies this situation. We leave its proof to the reader. (See Problem 12.10.) 

LEMMA 12.6.3 Let G = (V,E) be an N -vertex planar graph having non-negative vertex costs 
summing to c(V). Then V can be partitioned into three sets, A, B, andC, such that no edge joins 
vertices in A with those in B, neither A nor B has cost exceeding 7c(V)/9, \A\, \B\ < 5N/6, 
and C contains no more than K\ \/N vertices, where K\ = 4( \f2J3 + 1 ) . 

This new result can be applied to show that the vertices of a planar graph can be partitioned 
into many sets each having about the same cost and such that a small set of vertices can be 
removed to separate each set from all other sets. This result is also left to the reader. (See 
Problem 12.11.) 

LEMMA 1 2.6.4 Let G = (V, E) be an N -vertex planar graph and let c be a non-negative cost 
function on V with total cost ofc(V). Let P > 2. There are constants 2P/3 < q < 3P and 
Ki = 4(\/2/3 + 1)/(1 — v5/6) such that V can be partitioned into q sets, A\,At, . . . ,A q 
such that for 1 < i < q 

c(V)/(3P) < c{Ai) < 3c(V)/(2P) 



and there are sets Ci, \Ci\ < K 2 v N, and Bi = V — Ai — Ci such that no edges join vertices in 
Ai with vertices in Bi. 

12.7 The Performance of VLSI Algorithms 

Using Theorem 12.6.1 and Lemma 12.6 A, we now derive lower bounds on AT and AT 
for individual functions by deriving lower bounds on their planar circuit size. In the following 
section we derive lower bounds to the planar circuit size for multi-output functions using the 
w(u, v)-flow property of these functions. In Section 12.7.2 we set the stage for deriving lower 
bounds on the planar circuit size of predicates. 
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12.7.1 The Performance of VLSI Algorithms on Functions 

The w(u,v)-Row property of functions is introduced in Section 10.4.1 and applied to the 
study of space-time tradeoffs in the pebble game. In this section we use this property to derive 
lower bounds on the semellective planar circuit size of multi-output functions. 

DEFINITION 12.7.1 A function f : X n i— > X' m has a w(u,v) -flow if for all subsets U\ and 
V\ of its n input and m output variables with \U\\ > u and \V\\ > v there is some assignment 
to variables not in U\ (variables in U$) such that the resulting subfunction h of f that maps input 
variables in U\ to output variables in V\ (the other outputs are discarded) has at least |X|" , ( U, ' U ) 
points in the image of its domain. (Note that w(u, v) > 0.) 

A lower bound on planar circuit size of a function / is now derived from its w(u, w)-flow 
property. For some functions the parameter P will need to be large for w(u, v) > 0, as is seen 
Lemma 12.7.1. 



THEOREM 12.7.1 Let f : X n \—> X m have aw{u,v)-flow. Then its semellective planar circuit 
size must satisfy the following lower bound for u > n(l — 3/2P), v > m/(3P), and P > 2, 
where K 2 = A{^/tj3 + 1)/(1 - v / 5/6)- 

w 2 (u,v) 



AK\ 



Proof Consider a minimal semellective planar circuit for / : X n \— > X m on n inputs con- 
taining N = C p> n(/) inputs, gates, and crossings. We apply the version of the planar sepa- 
rator theorem given in Lemma 12.6.4 to this circuit by assigning unit weight to each input 
vertex and zero weight to all other vertices. For any integer P < \V \ we conclude that the 
inputs, gates, and crossings of this circuit can be partitioned into q sets {A\, A%, . . . , A q }, 
for 2P/3 < q < 3P, such that each set has at least n/(3P) and at most 3n/{2P) input 
vertices. Since the average number of output vertices in these sets is m/q, at least one set, 
call it A\, has at least the average of output vertices or at least m/3P vertices. Let Uo and 
V\ be the sets of inputs and outputs in A\, respectively. Then, n/(3P) < \Uo\ < 3n/(2P) 
and | Ft | > m/3P. 

For some assignment of values to variables in Uo, there are at least |X|™^ U ''"' values for 
the outputs in V\ when u = n — \Uq\ > n(l — 3/2P) and v = \V\\ > m/(3P). But 
all of the values assumed by the outputs in V\ must be assumed by the inputs, gates, and 
crossing wires of the separator. Since at most two wires cross, a separator C of size \C\ has 
at most 2\C\ inputs, gates, and wires each of which can have at most \X\ values. Thus, 
if C\, the separator for A\, has a size satisfying 2\C\\ < w(u,v), a contradiction results 
and the output variables in V\ cannot assume |X| 1 "^' t '^ values. It follows that |Ci| > 
w(u,v)/2. Since C\ < K2VN, this implies that N > w 2 (u, v) /(2K2) 2 , the desired 
conclusion. ■ 

We apply this general result to (a, n, m, p) -independent functions and matrix multiplica- 
tion. A function is (a, n,m,p) -independent (see Definition 10.4.2) if it has a w(u,v) -Row 
satisfying w(u, v) > (v/a) — 1 for n — u + v < p, where n — u > 0. 
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LEMMA 12.7.1 Let f : X n i— > X m be (a, n,m,p) -independent. Then for P > (m/3 + 
3n/2)/p andm > 2a, f has semellective planar circuit size satisfying the following lower bound: 

Cp ' n(f) ~ 144(aP) 2 K 2 
Proof / has a w(u, w)-flow satisfying w(u, v) > (v/a) — 1 for n — u + v < p. When 
u > n(\ — 3/2P), n — u + v < p is satisfied iff < p — 3n/(2P). Since we also require 
that v > m/(3P), this implies that P > (m/3 + 3n/2)/p. Also, v/a — 1 > v /2a if 
v > 2a. Substituting m/3P for v, we have the desired conclusion. ■ 

In Section 10.5 we have shown that many functions are (a, n, m,p) -independent. We 
summarize these results below. 

Name Function Independence Property 

Wrapped convolution /^ pped : Tl 2n i-> TZ n (2, 2n, n, n/2) 

Cyclic shift /^ : B n+ r'°s "T >-> B n {2,n+ [log n] , n, n/2) 

Integer multiplication /^ t : B 2n i— > B 2n (2, 2n, n, n/2) 

n-point DFT F n : K n >-> 7e n (2, n, n, n/2) 

It follows that for each case Lemma 12.7.1 holds when P < m/(6a). Thus, each of the 
(a, n, m,p)-independent function has a planar circuit size that is quadratic in n, its number 
of inputs. The following theorem results from this observation and Theorem 12.6.1. 

THEOREM 12.7.2 The area A and time T required to compute f^ A ■ 1Z 2n >-> K n , 
/cyclic : S"+ri°g»l ^ B n , f^ u : B 2n h-> B 2n , andF n : K n i-> K n on a semellec- 
tive VLSI chip satisfy the following bounds: 

AT 2 ,A 2 T = n{n 2 ) 

The AT 2 lower bound can be achieved up to a constant multiplicative factor for each of these 
functions for O(logn) <T< y/n. 

Proof From Theorem 12.5.1 we know that any fully normal algorithm can achieve the 
AT 2 = 0(n 2 ) for H(logn) = T = O(yfn) on an embedded CCC network. Since cyclic 
shift and FFT are shown to be fully normal (see Section 7.7), we have matching upper and 
lower bounds for them. From Problem 12.13 we have that the wrapped convolution can 
be realized with matching bounds on AT over the same range of values for T. The same 
statement applies to integer multiplication (see Problem 12.16). ■ 

In Section 12.6.1 we said that we would exhibit a function whose planar circuit size is 
nearly quadratic in its standard circuit size. This property holds for the cyclic shifting function 
because, as shown in Section 2.5.2, / c ™i ic : B n+ ' lo S" \—¥ B n has circuit size no larger than 
0(n log n), whereas from the above its planar circuit size is 0(n ). 

The cyclic shift function is also an example of a function for which most of the chip area 
is occupied by wires when T = 0(\Jn/ logn), because in this case the area is £l(nlogn) but 
the number of gates needed to realize it is 0(n log n). 
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Lower bounds on AT and AT also exist for matrix multiplication. From Lemma 10.5.3 

we know that the matrix multiplication function f^B '• ^ 2 ™ l— * ^™ nas a w { u -> u)-flow, 
where w{u, v) > (v — {2n 2 — u) 1 / An 2 ) / 2. Using this we have the following lower bound on 
the planar circuit size of this function. 

THEOREM 12.7.3 The area A andtimeT required to compute the matrix multiplication function 
JA-xB '• ^- 2n l— * ^™ w i f h a semellective VLSI algorithm satisfies the following lower bound: 

AT 2 ,A 2 T = n{n 4 ) 

The AT lower bound can be met to within a constant multiplicative factor. 

Proof Apply Theorem 12.7.1 to matrix multiplication by replacing the number of input 
variables n by 2n 2 and the number of output variables m by n 2 . The w{u, ti)-flow function 
has value 

w{u,v) = {v- {In 2 - u) 2 /4n 2 )/2 > — 

The right-hand side is maximized when P = 14 and has value greater than n /163, from 
which the conclusion follows. 

As shown in Section 7.5.3, two nxn matrix can be multiplied with area A = 0{n ) and 
time T = n, which meets the lower bound up to a multiplicative factor. Other near-optimal 
solutions also exist. (See Problem 12.15.) ■ 

12.7.2 The Performance of VLSI Algorithms on Predicates 

The approach taken above can be extended to predicates, functions whose range is B. Again 
we derive lower bounds on the size of the smallest planar circuit for a function. However, since 
the flow of information from inputs to outputs is at most one bit, we must find some other 
way to measure the amount of information that must be exchanged between the two halves 
of a planar circuit. An extension of the communication complexity measure introduced in 
Section 9.7.1 serves this purpose. 

The communication complexity measure of Section 9.7.1 assumes that two players ex- 
change bits to compute the value of a Boolean function / : B n i— > B. The input variables 
of / are partitioned into two sets U and V and assigned to two players. Given this partition, 
the players choose a protocol (a scheme for alternating the transmission of bits from one to 
the other) by which to decide the value of / for every input n-tuple of /. The bits of each 
n-tuple are partitioned between the two players according to the division of the n input vari- 
ables between the sets U and V. The players then use their protocol to determine the value of 
/. The communication complexity C{U , V) of this game is the minimum over protocols of 
the maximum over input n-tuples of the number of bits exchanged by the players to compute 
/ given the partition of the input variables into sets U and V. This measure and its associated 
game are naturally extended to predicates / : X n i— > B, whose variables assume values over 
the set X . Players now exchange values drawn from the set X . 

We can derive a lower bound on planar circuit size by applying the planar separator theo- 
rem. Since this theorem partitions the input variables into three sets, A, B, and a separator C, 
where A and B contain at most two-thirds of the total number of input vertices, it is natural 
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to extend the standard communication complexity measure to the following VLSI communi- 
cation complexity measure for functions / : X n \— > B. 

DEFINITION 12.7.2 The VLSI communication complexity of a predicate f : X n i— > B, 
CC v \ s i(f), is the minimum of the communication complexity C(U, V) over all partitions (U, V) 
of the variables off into two sets of size at most 2n/3. 

The following theorem, which is left as an exercise (see Problem 12.17), summarizes the 
result of applying the VLSI communication complexity measure CC v i s i(/) together with the 
planar separator theorem to derive a lower bound on the semellective planar circuit size of 
predicates. 

THEOREM 12.7.4 Let f : X n \— > B have VLSI communication complexity CC v i s i(/)- Then, 
the following bounds hold for the computation of f by a semellective VLSL chip with area A in T 
steps. 

(CC v isi(/)) 2 = 0(AT 2 ),0(A 2 T) 

Note that in a planar circuit all the information passed from each side of the separator 
to the other is sent simultaneously, whereas in the communication game players alternate in 
sending values drawn from the set X . Because more freedom is granted to players in the com- 
munication game (each player can choose data to send based on responses previously received 
from the other player), a lower bound on communication complexity is a lower bound on the 
amount of information that must be exchange in a planar circuit. 

A number of techniques have been developed to derive lower bounds on the planar circuit 
size of predicates. One of these uses the pigeonhole principle (also known as a crossing- 
sequence argument) to derive lower bounds for predicates that are w(u, w)-separated. This 
new property is similar to the w(u, t))-flow property of multi-output functions. It is defined 
below. 



DEFINITION 12.7.3 A function f : X n i— > B is w(u,v) -separated if its variables can be per- 
muted and partitioned into three sets U, V, and Z, \U\ > u and \V\ > V, such that there is some 
value z for variables in Z and values U{ andvi, 1 < i < \X\ W ^ U,V > , for variables in U andV, 
respectively, such that the following holds: 



f(Ui,Vj, 



1 ifi = j 
otherwise 



This definition can be applied to predicates that are associated with multi-output functions. 
These functions are defined below. 

DEFINITION 12.7.4 The characteristic predicate p f : X ( - n+m '> \->Boff: jW i-> X^ is 
defined below. 



Pf(x,y) 



i ify = f(x) 

otherwise 



It is straightforward to show that the characteristic predicate of a function that has a 
w(u, v)-Row is w(u, v) -separated. (See Problem 12.18.) As a consequence, quadratic lower 
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bounds exist on the semellective planar circuit size of the characteristic predicates of the con- 
volution, cyclic shift, integer multiplication, discrete Fourier transform, matrix multiplication 
functions, and many others. 

12.8 Area Bounds 

We now derive lower bounds on the area used by semellective VLSI chip algorithms for a 
variety of functions. For the functions considered here, these bounds are linear in their number 
of variables. As explained in the Chapter Notes, not all functions are amenable to the type of 
analysis presented in this section. 

The technique used to derive area lower bounds is similar to that used in Section 10.4.2 
to derive lower bounds on the exchange of space for time in the pebble game. If a chip has 
many I/O ports, it has large area. On the other hand, if it has a small number of ports, the 
inputs to the function computed are received over many cycles. If the function has a large 
w(u, v)-&ow, by direct analogy with the pebble game, the area must be large to insure that 
enough information be stored between cycles. 

THEOREM 12.8.1 Let > 1. Iff : X n i-> X m has a w(u,v)-flow, every chip computing f 
requires area A = fi(min( (m/2/3), w(u, v))), where u = n{\ — 1//3) andv = (m/4/3). 

Proof If the chip has n I/O pads or can store S values over the alphabet X, it has area 
A > X 2 min(7r, S). Fix j3 > 1. Its value is chosen later to provide a strong lower bound. If 
■k > m/2/3, we are done. Thus, we show that S > w(u, v) when n < m/2/3. 

Let the VLSI algorithm have T time steps and let hi < it outputs be generated on the 
ith time step, 1 < i < T. Create q intervals of consecutive time steps as follows: The first 
interval contains the first k\ time steps, where k\ is such that the total number of outputs 
produced during the first h\ steps is as large as possible without exceeding m/ [3. Successive 
intervals are created in the same way, namely by grouping consecutive later time steps to 
satisfy the same requirement on the number of outputs produced. For all intervals except 
possibly the last, the number of outputs produced is at least (m//3) — n + 1 > (m/2/3). 
If the last interval contains fewer than (m/2/3) outputs, redistribute the elements in the last 
two intervals, of which there are at least (m//3) — 7r + 2 > (m/2/3) + 2, so that each has at 
least (m/4/3) + 1 outputs. It follows that the number of intervals, q, satisfies j3 < q < 4/3. 

We now examine the inputs read during intervals. Since there are n inputs to be read 
and each is read once, the average number read per interval is n/q which is at most n//3. It 
follows that there is some interval / in which at least (m/4/3) + 1 outputs are pebbled and 
at most n//3 inputs are read. 

Fix the inputs that are read during /. The remaining inputs, of which there are at least 
u = n(l — 1//3), are free to vary. The number of outputs produced during I is at least 
v = (m/4/3). Since / has a w{u, -u)-flow, if S < w(u, v), the v outputs, whose values are 
determined by the values stored on the chip at the beginning of /, cannot assume all their 
values. It follows that S > w(u, v), which is the desired conclusion. ■ 

We now apply this bound to (a, n, m,p) -independent functions. Later we apply it to the 
matrix multiplication function. 

THEOREM 12.8.2 Let f : X n i— > X' m be {a,n,m,p)-independent. It requires area A = 
A ((mp/(n + m/A)a) — 1) when realized by a semellective VLSI algorithm. 
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Proof We apply Theorem 12.8.1 with u = n{\ — 1//3) and v = (m/4/3). Because / is 
(a, n, m,p) independent, w(u, v) > v/a — 1 for n — u + v < p. Since n — u = n/ (3 and 
v = (m/A[3), this implies that f3 > (n + m/A)/p. The lower bound of Theorem 12.8.1 
then is the smaller of (m/2/3) and (m/Aaf3) — 1. Since we are free to choose [3, we choose 
it to make the smaller of the two as large as possible. In particular, we set [3 = (n + m/A)/p, 
which provides the desired result. ■ 

Because all of the (a, n, m,p) -independent functions listed in Theorem 12.7.2 have n, 
m, and p proportional to one another, each requires area A = il(n), as stated below. It 
follows that the lower bound AT = fi(ft ) for these problems cannot be achieved to within 
a constant multiplicative factor if T grows more rapidly with n than y/n. 

COROLLARY 12.8.1 The functions /^ ppcd : TZ ln ^ K n , f^ ] clic : F' l +n°g™l ,_> B '\ 

/mult ■ B 2n >— > B 2n , and F n : lZ n i— > TZ n each require area A = fl(n) when realized by a 
semellective VLSI algorithm. 

A similar result applies to matrix multiplication. 

THEOREM 12.8.3 The area A required to compute the matrix multiplication function f\Jg '■ 
lZ 2n i— > W l with a semellective VLSI algorithm satisfies A = il(n 2 ) 

Proof We apply Theorem 12.8.1 with n and m replaced by 2n and n , respectively. Since 
U = 2n 2 (l — 1//3) and V = (n 2 /AP), the lower bound on w(u, w)-flow for matrix multi- 
plication function satisfies the following 

w(u, v) = (v- (In 1 - u) 1 J An 1 ) j2 > !L Q_ _ .1 

The lower bound is a positive multiple of n 1 if /3 > A and largest for j3 = 8, from which 
the desired conclusion follows. ■ 



Problems 

VLSI COMPUTATIONAL MODELS 

12.1 Assume the I/O ports are on the periphery of a convex chip. In the speed-of-light model 
show that if p such ports all have paths to some point on the chip, then the time for 
data supplied to each port to reach that point is 0(p). 

12.2 Under the assumptions of Problem 12.1, derive a lower bound on the time to compute 
a function / on n inputs under the additional assumption that there is a path on the 
chip from the port at which each variable arrives to the port at which / is produced. 

Hint: Show that the time required is at least the sum of the number of cycles needed 
to read all n inputs and the time for data to travel across the chip. State these times in 
terms of p and choose p to maximize the smaller of these two lower bounds. 
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CHIP LAYOUT 

12.3 Show that every layout of a balanced binary tree on n leaves in which the root and the 
leaves are placed on the boundary of a convex region has area proportional to n log n. 
Hint: Consider an inscribed quadrilateral defined by the longest chord and a chord 
perpendicular to it. 

12.4 The n X n mesh-of-trees network, n = 2 r , is described in Problem 7.4. Give an area- 
efficient layout for an arbitrary graph in this family of graphs and derive an expression 
for its area. 

12.5 Let n = 2 . As suggested in Fig. 12.9, the n x n tree of meshes T n is a binary tree 
in which each vertex is a mesh and the meshes are decreasing in size with distance from 
the root. The edges between vertices are bundles of parallel wires. The root vertex is 
an n x n mesh, its immediate descendants are n/2 x n meshes, and their immediate 
descendants are n/2 x n/2 descendants, and so on. 

The depth-d, n x n mesh of trees, T n ^, is T n that has been truncated to vertices at 

distance d or less from the root. 

Determine the area of an area-efficient layout of the tree r„,d. 

COMPUTATIONAL INEQUALITIES 

12.6 Use the results of Problem 12.11 to extend Theorem 12.7.1 to multilective planar 
circuits of order /!. 

12.7 Further extend the results of Problem 12.6 to (/3, //) -multilective VLSI algorithms by 
showing that, at the expense of a small increase in AT and AT, multiple inputs of a 
variable at the same I/O port can be treated as a single input, thereby possibly reducing 
the multilective order of the corresponding planar circuit. This implies that if multiple 
copies of each variable are read at a single port, then the semellective planar circuit size 
is a lower bound to both AT 2 and A 2 T. 
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Figure 12.9 The 4x4 tree of meshes, T 4 . 
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THE PLANAR SEPARATOR THEOREM 

12.8 The pizza pie graph G = (V, E) has n = \V\ — 1 vertices that are uniformly spaced 
points on a circle as well as a vertex at the center of the circle. E consists of the arcs 
between vertices on the circle and edges between the central vertex and vertices on the 
circle. 

When n = 12, triangulate G by adding edges between vertices on its external face. 
Illustrate Lemma 12.6.2 by choosing a cost function c and constructing two sets whose 
cost at most 2 c( V)/3 and a separator containing at most three vertices. 

12.9 In a spanning tree for a graph G = (V, E) the level of a vertex is the length of the path 
from the root to it. Given a non-negative cost function on the vertices of G totaling 
c(V), show there is some level m such that the cost of vertices at levels less than and 
more than m each is at most c(V)/2. 

12.10 (Two-Cost Planar Separator Theorem) Let G = (V, E) be an TV-vertex planar graph 
having non-negative vertex costs summing to c(V). Show that V can be partitioned 
into three sets, A, B, and C, such that no edge joins vertices in A with those in B, 
neither A nor B has cost exceeding 7c(V)/9, \A\ and \B\ contain at most 5-/V/6 
vertices, and C contains no more than K\\N vertices, where K\ = 4(y 2/3 + 1). 
Hint: Apply the planar separator theorem twice. The first time use it to partition V 
into two sets of about the same size and a separator. If each of the two sets has cost 
at most 2c(y)/3, the result holds. If not, make a second application of the planar 
separator theorem to the set with larger cost. Show that it is possible to combine sets to 
simultaneously meet both the size and cost requirements. 

12. 1 1 Let G = (V, E) be an TV -vertex planar graph and let c be a non-negative cost function 
on V with total cost c(V). Let P > 2. Show there are constants 2P/3 < q < 3P and 
K 2 = 4( yJl/3 + 1 ) / ( 1 — y/5/6) such that V can be partitioned into q sets, A\,A 2 , 
. . . , Aq such that for 1 < i < q 

c(V)/(3P) < c(Ai) < 3c(V)/(2P) 

and there are sets Cj, |Cj| < K 2 y N, and Bi = V — Ai — Ci such that no edges join 
vertices in Aj with vertices in Bi . 

Hint: When P = 2, use the result of Problem 12.10 and combine the vertices of the 
separator with the other two sets to satisfy the necessary conditions. When P > 2, 
subdivide any set with cost exceeding c(V)/P into two sets and a separator using the 
two-cost planar separator theorem. Assign vertices of the separator to these two sets to 
keep the cost in balance. 

THE PERFORMANCE OF VLSI ALGORITHMS 

12.12 Show that the function defined by the product of three square matrices has a semel- 
lective planar circuit size that is quadratic in its number of variables and that it can be 
realized by a VLSI chip with AT that meets the semellective planar circuit size lower 
bound. 

12.13 Show that the wrapped convolution function fl,'„ od '■ Ti- 2n | — » TV 1 , can be realized 
as an embedded CCC network on a VLSI circuit with area A and time T satisfying 
AT 2 = 9(n 2 ) for O(logn) < T < y/n. 
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12.14 Design a VLSI chip for n x n matrix multiplication that achieves AT 2 = n log n for 
T = O(logn). 

Hint: Represent each matrix as a 2 x 2 matrix of (n/2) x (n/2) matrices and use the 
standard algorithm that performs eight multiplications of {n/2) x (n/2) matrices. A 
multiplier has one side longer than the other. Place the long side of the (n/2) x (n/2) 
matrix multiplier at right angles to the long side of the n x n matrix multiplier. Apply 
this rule to the recursive construction of the multiplier. 

12.15 Show that an algorithm of the kind described in Problem 12.14 can be combined with 
a mesh-based matrix multiplication algorithm of the kind described in Section 7.5.3 to 
produce a family of algorithms that achieve the lower bound on n x n matrix multipli- 
cation for f2(logra) < T < n. 

12.16 Devise a VLSI chip for n-bit integer multiplication function chip that uses area A and 
time T efficiently. 

Hint: Let x and y denote binary numbers. Recursively form the product of these 
integers as the sum of two products, that of a; with the high-order (n/2) bits of y and 
that of a; with the low-order (n/2) bits of y. Use carry-save addition where possible. 

12.17 Give a proof of Theorem 12.7.4. 

12.18 Show that the characteristic predicate of a function that has a w(u, j))-flow is w(u, v)- 
separated. 

AREA BOUNDS 

12.19 Show that any VLSI algorithm that realizes a superconcentrator on n inputs requires 
area Q(n). 



Chapter Notes 



Mead and Conway wrote an influential book [213] that greatly simplified the design rules for 
VLSI chips and made VLSI design accessible to a large audience. Ullman [339] summarized 
the status of the field around 1984 and Lengauer [193] addressed the VLSI layout problem. 
Lengauer has also written a survey paper [194] that provides an overview of the theory of VLSI 
algorithms as of about 1990. The three transmission models described in Section 12.2 reflect 
the analysis of Zhou, Preparata, and Khang [372]. 

Thompson [326] obtained the first important tradeoff results for the VLSI model of com- 
putation. He demonstrated that under a suitable model a lower bound of AT = £l(n ) 
could be derived for the discrete Fourier transform, a result he subsequently extended to sort- 
ing [327]. Generalizations of this model were made to convex chips [59], compact plane 
regions [195], and other closely related models [202], Vuillemin [355] extended the models 
to include pipelining. Chazelle and Monier [67] introduced the transmission-line model de- 
scribed in Problems 12.1 and 12.2. For a discussion of other models that take into account the 
effects of distributed resistance, capacitance and inductance, see [40] and [372]. 

Systolic algorithms, which make good use of area and time, were popularized by Kung 
[177] and others (see, for example, [104,122,179,180,181,190]). The H-tree featured in Sec- 
tion 12.5.1 is due to Mead and Rem [214], Prefix computations are discussed in Chapter 2. 
The cube-connected cycles network (its layout is given in Section 12.5.3) and the efficient 
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realization of normal algorithms are due to Preparata and Vuillemin [262], as explained in 
Chapter 7. Lengauer [193] provides an in-depth treatment of algorithms for VLSI chip lay- 
out. 

Most authors prefer to derive lower bounds on AT" by partitioning the planar region oc- 
cupied by chips [59,195,326]. In effect, they employ a physical version of the planar separator 
theorem. The characterization of VLSI lower bounds in terms of planar circuit complexity in- 
troduced by Savage [288] reinforces the connection between memoryless and memory-based 
computation explored in Chapter 3 but for planar computations by VLSI chips. It also pro- 
vides an opportunity to introduce the elegant planar separator theorem of Lipton and Tarjan 
[203]. Lipton and Tarjan [204] developed quadratic lower bounds on the planar circuit size of 
shifting and matrix multiplication before the connection was established between VLSI com- 
plexity and planar circuit size. Improving upon results of [288], McColl [209] and McColl and 
Paterson [210] show that almost all Boolean functions on n variables require a planar circuit 
size of Q(2 n ) and that this lower bound can be achieved for all functions to within a constant 
multiplicative factor close to 1. Turin [336] has shown that the upper bound of Lemma 12.6.1 
is tight by exhibiting a family of functions of linear standard circuit size whose planar circuit 
size is quadratic. 

Abelson [1] and Yao [366] studied communication complexity with fixed partitions. Yao 
[367] and Lipton and Sedgewick [202] made explicit the implicit connection between VLSI 
communication complexity and the derivation of the AT lower bounds. (See also [236] , 
[12], and [194] for a discussion of the conditions under which lower bounds can be derived 
on the VLSI communication complexity measure.) 

Many authors have contributed to the derivation of semellective lower bounds for partic- 
ular functions. Among these are Thompson [326,327,328,329], who obtained bounds of the 
form AT = Cl(n ) for the DFT and sorting, as did Abelson and Andreae [3] and Brent 
and Kung [59] for integer multiplication, Jaja and Kumar [149] for a variety of problems, Bi- 
lardi and Preparata [41] for sorting, Savage for matrix multiplication, inversion, and transitive 
closure [289] and binary integer powers and reciprocals [288], and Vuillemin for transitive 
functions [355] (see Problem 10.22). These authors generally show that the lower bounds for 
functions can be met either to within a small multiplicative constant factor. 

Good VLSI designs have been given by Baudet, Preparata, and Vuillemin [31] for con- 
volution, Guibas and Liang [123] for systolic stacks, queues, and counters, and Kung and 
Song [183] and Kung, Ruane, and Yen [182] on 2D convolution. Also, Luk and Vuillemin 
[207] give an optimal VLSI integer multiplier and Mehlhorn has provided optimal algorithms 
for integer division and square rooting [2 1 7] whose range of optimality has been extended 
by Mehlhorn and Preparata [219]. Preparata [258] has given a mesh-based optimal VLSI 
multiplier for large integers and Preparata and Vuillemin have given optimal algorithms for 
multiplying square [260] and triangular matrices [261]. C. Savage [284] has given a systolic 
algorithm for graph connectivity. 

Lower bounds for the semellective computation of predicates by VLSI algorithms have 
been derived by Yao [367] for graph isomorphism, by Lipton and Sedgewick [202] for the 
recognition of context-free languages, pattern matching, and binary integer factorization test- 
ing, and by Savage [288] for the characteristic predicates of multi-output functions. 

Hochschild [134], Kedem and Zorat [163,164], Savage [290,291], and Turan [337] have 
developed lower bounds on performance of multilective VLSI algorithms. Savage has explored 
multilective planar circuit size [291], giving a multi-output function with a il(n ' 3 ) lower 
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bound. Turan [337] exhibits a function and a predicate with S7(n 3 ' 2 logn) and il(nlogn) 
lower bounds to their multilective planar circuit size, respectively. The w(u, v)-fiow and 
w(u, v)-separated properties used in Section 12.7 were introduced in [291]. 

Lower bounds on the area of chips have been explored by a number of authors. Yao [367] 
examined addition; Baudet [30] studied functions that do not have a large information flow; 
Heintz [131] derived bounds for matrix-matrix multiplication; Leighton [191] introduced and 
used the crossing number of a graph to derive area bounds; Siegel [309] derived bounds for 
sorting; and Savage [288] examined functions with many subfunctions. Bilardi and Preparata 
[42] have generalized arguments of [30] and [152] to derive stronger area-time lower bounds 
for functions, such as prefix, for which the information flow arguments give weak results. 
Lower bounds on the area of multilective chips were obtained by Savage [291], Hromkovic 
[142,143], and Duris and Galil [93]. 

Models for 3D VLSI chips, which are not yet a reality, have been introduced by Rosenberg 
[282,283] and studied by Preparata [263]. 
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Chomsky, 

hierarchy 5 182 
normal form, CFLs 187 
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Church-Turing thesis209 
circuit(s), 
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addition function 60 
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(chapter) 237(*) 

LDL factorization of SPD matrices 258 
approximator 426 427 429 430 
basis 239 
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decoder 54 

demultiplexer function 55 
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binary multiplication function 66 

decoder 54 
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monotone communication game 
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FFT, algorithm 267 
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full-adder 18 
functions, 
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integrated 575 
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Boolean matrix multiplication on 423 
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RAM, next-state/output functions 120 
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transitive 182 248(*) 

application to parsing CFLs 190 
circuits for 251 252 
function, complexity of 249 
matrix 248 

reduction of matrix multiplication 
function to 250 
cmdllO 

coarse-grained parallel computers284 
Cobham, A.526 608 
Cocke, J.207 
color player442 

combinatorial circuits, (chapter)237(*) 
combiners406 

COMMON model, PRAM313 
communication, 
complexity 437 

depth relationship 438 439 
monotone depth relationship 440(*) 
of clique function 447 
VLSI 595 
game 437 441 447 
parity communication problem 438 
commutativity40 

non, matrix multiplication 242 
rings 239 264(*) 
comparator, 

based, merging networks 48 1 
circuit 35 
element 48 1 
function 270 
networks 270 271 
compare-exchange operation560 



competitive, 

analysis 567 

memory management 567(*) 
compiler 187 

complementary number system432 
complementation, 

CFL, not closed under 199 
complements, 

complexity class 343(*) 
decision problem 329 
language 170 

decision problems and 329(*) 
NP 347(*) 
Schur254 
complete, 
basis 84 392 

formula size relationship 399 
language 130(*) 

NP 130 

P 130 
problems 350(*), 351 
records 339 
complexity, 
circuit 27(*) 

bounded-fanout 395 

(chapter) 39 1(*) 

depth 436(*) 

measures of 40(*), 393(*) 

measures, relationships among 394(*) 

relationship to TM computation time 5 
classes 26(*), 334 

(chapter) 327(*) 

circuit, containment of 381 

circuits 380(*) 

complements of 343(*) 

relationships among 342 

space-bounded 338(*) 

time-bounded 337(*) 

time-bounded and space-bounded 
relationships 34 1 (*) 

time-bounded, containment among 337 

transformation class relationships with 
350 
communication 437 

depth relationship 439 

general depth and 438(*) 

monotone depth relationship 440(*) 

of clique function 447 

VLSI 595 
complex instruction set computer, See CISC 
computational 23(*) 
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complexity (cont.) 

computational (cont.) 

brief history 5 
I/O 563 

brief history 6 
measures, size of smallest circuit for a 

function 118 
theory, P and NP-complete language role 

128 
TM vs circuit size, as tool to resolve P = NP 

equality question 128 
transitive closure frunction 249 
composition, 
function 23 1 
log-space TM 351 
computability, 
(chapter) 209 

feasible problems, serial computation thesis 
330 
computation, 

bounded, impossibility theorem for 24 

capacity 532 

circuit, 

equivalence between FSM and 96 

model, logic circuits as 16(*) 

reductions of TM computations to 128(*) 
cost, with HMM 563 
data-dependent, branching programs 488 
function, by standard TM 210 
locality of reference 558 
multilective 580 
on a branching program 489 
parallel 27(*) 

(chapter) 281 

circuit models 372 
period 582 
prefix 55(*), 583(*) 
read-once 580 

restricted models of, representing 217(*) 
semellective 580 
serial, thesis 330 
step, red-blue pebble game 530 
time, 

in the VLSI synchronous model 579 

pebbling strategy 531 
computational, 
complexity 23(*) 

brief history 5 
inequalities 23(*) 

for FSM 95(*) 

for interconnected FSMs 97 



computational (cont.) 
inequalities (cont.) 

for the random-access memory 1 17(*) 
for the TM 127 134 127(*) 
RAM 118 
VLSI chips 587(*) 
models 16(*) 

branching program comparison 

with 493 (*) 
parallel 282(*) 
(part I - chapters 2-7) 35 
serial 33 1(*) 
VLSI 579(*) 
VLSI, (chapter) 575(*) 
time, TM relationship to circuit complexity 

5 
work, 

on FSM 96 
on PRAM 290 
computer(s) , 

balanced systems 532(*) 
distributed memory 284 285 
distributed shared memory 285 
networked 287(*) 
parallel, 

Brent's principle 29 1 (*) 
Flynn's taxonomy 285 
memoryless 282(*) 
synchronous 285 

unstructured, circuit as form of 283 
with memory 283(*) 
science 3 

shared memory 284 
concatenation9 

CFL closed under 198 
NFSM 164 
string 158 
concurrency, 
See also, PRAM 
power of 31 4(*) 
conditional vector operations286 
configuration, 

graph 218 334 340 
TM, k-tape218 
connection network289 
context-free grammar22 183 
Chomsky normal form 187 
context-sensitive grammarl83 
context-sensitive language(s)183(*), 183 
Chomsky, 

hierarchy component 5 
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context-sensitive language(s) (cont.) 
Chomsky (cont.) 

language type 182 
machine type that corresponds to, (table) 
182 
contradiction, proof byl5(*) 
control, 

CPU, simple CPU 142(*) 
unit 20 
PDA 177 
standard TM 210 
TM 118 
variable 144 474 
controllers406 
convolution, 
Boolean, 

circuit size 419 

function, circuit size lower bound 422 
as Borodin-Cook lower-bound method, 

application 505(*) 
complexity of fast algorithm 270 
FFT-based algorithm 269 

and 263(*) 
function 268 
I/O time bounds 553 
space-I/O time tradeoffs 552(*) 
systolic arrays and 28 
theorem 268(*) 
wrapped 276 473(*), 474 505 
space-time lower bound 505 
Conway, L.323 613 615 
Conway, L. A.601 
Cook, S. A.72 88 152 323 389 390 497 504 

526 527 528 606 607 608 
Cooley,J.W.279 608 
Coppersmith, D.245 278 608 
corollaries, 

area lower bounds, for independent 

functions, (12.8.1)598 
Boolean convolution function circuit size 

lower bound, (9.6.1) 422 
containment between time-bounded 

complexity classes, (8.5.2) 341 
distinguishable functions, space— time lower 

bound for, (10.11.1) 500 
existence of languages not in P, (8.6.1) 343 
FFT decomposition, (6.7.1) 267 
FSM, minimal-state, characterization of, 
(4.7.1) 174 



corollaries (cont.) 

Grigoriev's lower-bound method, (10.4.1) 

471 
I/O complexity bounds, multi-level, (11.4.1) 

539 
languages, accepted by NDTM accepted by 

DTM, (5.2.1)216 
matrix multiplication function, vis-a-vis 

transitive closure, (6.4.1) 250 
nondeterministic space classes closed under 

complements, (8.6.2) 346 
Savitch's Theorem, (8.5.1) 340 
separator theorem for trees, (9.2.1) 397 
space-time product lower bound for 

independent functions, (10.4.1) 
471 
time-bounded and space-bounded 

complexity class relationships, 
(8.5.2) 341 
Turing machine time lower bounds, (3.9.1) 
128 
counter, 

incrementing/decrementing 148 
modulo-p 148 
counting, 

binary trees 78 
function 75 
CPU (central processing unit) 19 110 
booting 141 

circuit size and depth 146(*) 
control, simple CPU 142(*) 
simple, 

design 111 137(*) 
instructions 140 
micro-instructions 142 
microcode 142 
registers 138 
timing, simple CPU 142(*) 
CRCW (Concurrent Read/Concurrent 
Write) PRAM3 13 
computing Boolean functions on 314 
simulation by EREW PRAM 314 
CREW (Concurrent Read/Exclusive Write) 
PRAM313 
circuits, 
and317(*) 

equivalence 376(*), 379 
simulation by 377 
P-complete problems 380 
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CREW (Concurrent Read/Exclusive Write) 
PRAM (com.) 

realization of log-space transformations on 
380 
crossbar network289 
crossing-sequence argument596 
cryptography, 

brief history 7 
Csanky, L.279 609 
Csanky's algorithm262 

fast matrix inversion with 260 
Csirik, J.620 
Culler, D. E.323 609 
cycle (s) 10 

cube-connected, See CCC 
fetch-and-execute 20 
inside of the 590 
cyclic shifting474(*) 
circuit 49 

efficient branching programs for 496(*) 
functions 48 474 
circuits for 50 

independence properties 474 
reductions between logical shifting 

functions 5 1 
space-time lower bound 475 
on the hypercube 303(*), 304 
reductions, between logical and 50(*) 
Cypher, R.323 609 



DAG (directed acyclic graph) 1 

adjacency matrix relationship 248 

circuits 238 

convolution theorem, FFT application 269 

logic circuit as a 1 6 

maximal path length 249 

space upper bounds 483 
DAM (directed acyclic multigraph)489 
data-dependent computation, 

branching programs 488 
dataflow computer, 

circuit simulation by 283 
decidable languages223(*), 225 

standard TM 210 
decimal, 

standard representation 8 
decision, 

binary decision diagram 490 



decision (cont.) 

branching program 489 
problems 328 

classification of 334(*) 

complement of 329 

language complements and 329(*), 330 

regular languages, algorithms 171 
tree 489 

multiway 561 
decoder53(*) 
function 53 

circuit for 54 
definitions, 
basis, 

measure 397 

of a circuit 38 
big, 

Oh notation, 0( ) 13 

Omega notation, f2( ) 13 

Theta notation, 8( ) 13 
bilinear form 420 
block I/O model 557 
Boolean, 

convolution function 419 

function class 400 403 

matrix multiplication 422 
branching program 489 

space on 490 
BTM 559 
central slice 435 
Chomsky normal form 187 
circuit 38 

depth 40 394 

depth with fan-out S 394 

family 373 

family, log-space uniform 373 

planar circuit size 586(*) 

size 40 393 

size with fan-out s 393 
communication, 

complexity, of a communication game 
437 

complexity, VLSI 596 
commutative rings 264 
complete problems 351 
complexity, 

class 334 

class, complements 343 
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definitions (cont.) 
complexity (cont.) 

communication, of a communication 
game 437 

communication, VLSI 596 
computation on a branching program 489 
configuration, 

graph 218 

k-tapeTM218 
decision problems and their languages 328 
depth, circuit 40 

derivation, phrase-structure grammar 182 
DFSM 154 

equivalence classes 176 
((/>, A, jj,, v, r)-distinguishability 497 
DTM 119 

language in P 120 
equivalence, 

classes 172 

relation, DFSM 172 

relation, for a language 172 

relation, refinement 173 

relation, right-invarian 172 

right-invariant, for a language 173 

states 175 
expressions, regular 158 
final state, FSM 92 
formula, size 394 
FSM, 

computational work on 96 

next-state function 92 

output alphabet 92 

output function 92 
functions, 

advice 382 

computed by straight-line programs, 38 

next-state, FSM 92 

pairing 382 

partial recursive 232 

polynomial advice 382 

primitive recursive 23 1 

proper 330 

reductions between 46 

symmetric 74 
general branching program 490 
goal, communication game 437 
grammar, 

context-free 183 

context-sensitive 183 

phrase-structure 182 

regular 184 



definitions (cont.) 

hard problems 351 
hierarchical memory model 563 
I/O, 

operation 559 

operations, simple 560 

time 559 
immediate derivation, phrase-structure 

grammar 1 82 
implicants 417 

(a, n, m,p) -independent function 469 
induction hypothesis 1 5 
initial state, FSM, 92 
input alphabet, FSM 92 
Kronecker product 503 
language, 

CIRCUIT SAT 132 

CIRCUIT VALUE 130 

context-free 183 

context-sensitive 183 

FAN-OUT TWO CIRCUIT SAT language 

150 

inNP 120 

in P 120 

MONOTONE CIRCUIT VALUE 150 

P-complete 130 

phrase-structure 182 

recognition, FSM 92 

regular 158 184 

SATISFIABILITY 132 
matrix, 

multiplication ring operations 245 

nice and ok 501 
monotone, 

communication game 441 

function replacement rule 418 
multigraph 489 

multiplication, smallest circuit 67 
ri-indistinguishable 175 
NC languages 380 
NDTM 120 214 

language in NP 120 
NFSM 154 
non-terminals, phrase-structure grammar 

182 
notation, 

big Oh, 0() 13 

big Omega, fi() 13 

bigTheta, 6() 13 
P and NP, 

complete problems 352 
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definitions (cont.) 
P and NP (cont.) 

problems 335 
P/poly languages 383 
permutation 74 
planar circuit size 586(*) 
programs, straight-line 38 
proof by contradiction 15 
protocol, communication game 437 
reducibility 226 
reduction, 

between functions 46 

via subfunction relationship 46 
regular, 

expressions 158 

languages 158 

sets 158 
rings 239(*) 
S-spanofaDAG537 
set, 

neighborhood 408 

of states, FSM 92 

regular 158 
size, circuit 40 
slice functions 43 1 

space-bounded complexity classes 338 
SPD matrices 253 

start symbol, phrase-structure grammar 182 
straight-line programs 38 

functions computed by 38 
subfunctions 46 

reductions via 46 
superconcentrator 485 
terminals, phrase-structure grammar 182 
time-bounded complexity classes 337 
TM, 

canonical encoding 22 1 

configuration 218 

standard 210 
transformation 348 

and complexity class relationships 350 

classes 350 
transitive, 

closure, phrase-structure grammar 182 

transformations 350 
unique elements 514 
universal 114 
vertex-disjoint 485 
w(u,v)-flow 469 
degree, 
in 10 



degree (cont.) 

out 10 
Dekel, E.323 609 
Demetrovics, J.620 
DeMorgan's rules, 

Boolean expressions 4 1 
demultiplexers 5 (*) 
dependent variables399 
depth, 

circuit 1 1 35 40 239 394 436(*) 
basis change effect on 396(*) 
bounded 448(*) 

errors with 6-approximator of 448 
formula size vs 396(*) 
in a simple CPU 146(*) 
monotone communication game 

relationship, 441 
relationship between formula size and 397 
simple lower bounds on 400 
with fan-out S 394 
communication complexity, 
relationship 438 (*) 
lower bound, for most Boolean functions 79 
monotone, 

clique function 442 (*) 
communication complexity relationship 
440(*) 
upper bound, for all Boolean functions 80 
derivationl81 
immediate 182 
leftmost 186 
parsing 186 
rightmost 186 
descending algorithms301 306 307 
designing, 

circuits 36(*) 
deterministic, 
FSM, See DFSM 
PDA 177 

Turing machine, See DTM 
DFSM (deterministic finite-state machine)98 
154 
See also, FSM; NFSM 
equivalence relation 172 
languages accepted by, same as languages 

accepted by NFSMs 156 
minimal, equivalence relation 177 
NFSM equivalence 1 56 
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DFT (discrete Fourier transform)263 264 (*) 

as Borodin-Cook lower-bound method, 
application 513(*) 

independence properties 479 

inverse 265 

space-time, 

lower bounds 480 513 
product 479 (*) 

vector-matrix product 513 
diagonalization225 
diagram, 

binary decision 490 

state 18 30 
FSM21 
diameter, 

graph, network 287 
Diaz, J.389 606 
difference, 

languages 170 

sets 7 

symmetric, between sets 234 
diffusion model, 

VLSI 579 
Ding-Zhu, Du618 
directed graph 1 
directed multigraph489 
discrete Fourier transform, 

See, DFT 
disjoint sets7 

(cf>, A, (i, v, r)-distinguishability property497 
distinguishability properties, 

flow property relationship to 500 

matrix multiplication 509 

matrix-vector product 508 

{(f), X, ft, V, t) 497 

unique elements 515 
distributed, 

computing, brief history 6 

memory computer 284 285 
routing in 309 

shared memory computer 285 
distributive laws4 1 
distributivity, 

Boolean expressions 42 
divide-and-conquer, 

multiplier, circuit for 67 

strategies, trees 288 
division, 

of integers 68 

reciprocal and 68 
domain, 



of a function 11 
dominant terms, 

as rate of grown indicator 13 

big Oh notation, 0( ) 13 

big Omega notation, f2( ) 13 

bigTheta notation, 0( ) 13 
doped layer576 
doping576 

DTM ACCEPTANCE language354 
DTM (deterministic Turing machine) , 

language acceptance 333 

language in P 120 

multi-tape 333 

P problems 335 

polynomial-time 330 374(*) 

recursive language 333 

simulation of RAM 332 

standard 118 210(*) 
dual-rail logic84 
Duff,I.S.613 

Dunne, P. E.89 457 458 609 
Duris, P.603 609 
dyadic unate basis392 
dynamic programming algorithml65 



Earley, J.207 609 

Eckert4 

Eckstein, D. M.323 609 

edge (s) 10 

Edmonds, J.330 609 

efficiency, 

PRAM 290 
eigenvalues26 1 
eigenvector261 
electronic lockl48 
elementary symmetric functions74 
elimination method, 

gates, 

for circuit size 400(*) 
general circuits 400 

Gaussian 274 

paths, monotone circuits, lower bounds 
derivation 413(*) 
embedding, 

ID arrays in 2D meshes 297(*) 

arrays in hypercubes 299(*) 

graph, problem 289 
empty, 
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set, acceptance problem 229 

string 181 
empty (cont.) 

tape acceptance problem 228 
emulation 1 47 (*) 
encoder5 1 (*) 

circuit 52 

function, circuit for 52 
encoding, 

canonical, of TM 221 

string, TM and 222(*) 

unary 383 
end-of-tape marker210 
endpoint, 

set 426 

size 426 
ENIAC4 

enumeration tape2 1 5 
equivalence, 

class 10 172 

DFSMandNFSM 156(*) 

regular expressions 159 

relations 10 172 
DFSM 172 
on languages 17 1(*) 
on states 171 (*) 
refinement 173 
right-invariant 172 
right-invariant 173 

states 175 
ERCW (Exclusive Read/Concurrent Write) 

PRAM313 
EREW (Exclusive Read/Exclusive Write) 
PRAM simulation3 1 3 

by hypercube network 317 

CRCW PRAM 314 

of normal algorithm 313 
error function45 1 
Evey, J.207 609 
EXACT COVER language360 
exclusive access, 

PRAM 312 
existential quantification365 
expansion, 

series, Taylor 73 

sum-of-products AA 
exponential, 

functions 13 



exponential (cont.) 

size, bounded-depth parity circuits 448 450 

time, polynomial time compared with 330 
expressions, 

See, regular expressions 
EXPTIME class337 

complexity class relationships 341 
extreme tradeoffs466(*), 467 



face, 

planar graph 590 
factorization, 

prime 87 
Schur 254(*) 
Faddeev, D. K.279 609 
Faddeeva, V. N.279 609 
fan-in, 
circuit 38 
of a basis 392 
trees 394 
fan-out, 
circuit 38 

size impact 394(*) 
reduction 150 

construction used for 215 
fan-out-1 circuit392 393 

relationship to formula size 394 
fast Fourier transform, 

See, FFT 
feasible381 

problems 335 
Feig, E.573 606 

fetch-and-execute cycle20 110 138 139(*) 
FFT (fast Fourier transform) , 
algorithm 266(*), 267 301 
convolution and 263 (*) 
circuit 266 

convolution application 269 
decomposition 267 
graph, 

butterfly 238 
decomposition 267 548 
pebbling 463 
pebbling of 25 

with column numberings 302 
I/O time bounds, 

in red-blue pebble game 547 
MHG549 551 
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FFT (fast Fourier transform) (cont.) 

lower performance bounds 565 
5-span 546 

space-I/O time tradeoffs 546 (*) 
straight-line program for 238 
Fich, F.528 607 
field274 

FIFO (first-in, first-out), 
LRU analysis relative to 568 
page-replacement algorithm 568 
final state92 154 

fine-grained parallel computers283 284 
finite, 

functions 12 
language 9 
finite-state machine, 

See, FSM 
first order linear recurrence86 
Fischer, C. N609 

Fischer, M. J.152 456 528 607 609 613 616 
flip-flop 109 
floor function 13 
FLOP (floating point operations per 

second)282 
flow properties, 

distinguishability property relationship to 

500 
functions 469 (*) 
matrix multiplication 477 
Flynn, M. J.323 609 
Flynn's taxonomy, 

parallel computers 285 
form, 

bilinear 420 

semi-disjoint 420 

semi-disjoint, replacement rule 420 
formal, 

computational models 4 
languages 4 21 
brief history 5 

Chomsky language hierarchy 5 
formula, 

fan-out- 1 circuit 392 393 
size 394 

bounds on 397 
circuit depth vs 396(*) 
fan-out- 1 relationship 394 
lower bounds for 404(*) 
over two different bases 399 
Fortune, S.323 390 609 
Foster, M.J.601 609 



Fourier, 

See, DFT; FFT 
Fraleigh, J. B.609 
Friedman, J.456 610 
FSM (finite-state machine)92(*) 
See also, DFSM; NFSM 
adder 108 

adding two binary numbers 101 
bounded, 

circuits and 96(*) 
brief history 5 
(chapter) 153(*) 
choice input 99 
circuit, 

compared with 94 

computation equivalence 96 

for 23 

simulation of 95 
computational, 

inequalities 95(*), 95 

model 18(*) 

work on 96 
computing exclusive or of its inputs 93 
decision problems, algorithms 171 
deterministic 98 

equivalence of DFSM and NFSM 156(*) 
exclusive or computation 97 
functions computed by 22 94(*), 95 
interconnction 97 
language, 

are regular 185 

association with 173 

described by regular expressions 1 64 

recognition by 22 
minimal algorithm for 175(*), 176 
models 154(*) 
nondeterministic 98(*), 154 
PDA control unit as a 177 
pumping lemma for 168(*) 
RAM as 111(*) 
regular expressions, 

recognition by 160(*) 

relationship with 158 
regular language recognition by 1 84 
ripple adder simulaton by 1 07 
simulating with shallow circuits 100(*) 
state-diagram 2 1 
synchronous 97 

circuit simulating 98 

interconnection of 97(*) 
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FSM (finite-state machine) (cont.) 

TM, 

control unit as 1 1 8 

relationship 217 

tape unit as 1 1 8 
universal RAM for 114 
VLSI chip design use 27 
foil adder59 

carry-save adder realization by 64 
circuit 18 
full two-input basis392 
fully normal algorithms301 306 

on 2D arrays 307 
function (s) 10 11 
addition 58 60 231 
advice, polynomial 382 
binary 12 39 

tree circuits for 78 
Boolean 12 

algebraic properties of 40(*) 

circuit-size lower bound for most 77 

circuit-size upper bound for all 82 

class 400 403 

class Q™ 401 

complex 77(*) 

computing on CRCW PRAM 314 

depth lower bound for most 79 

depth upper bound for all 80 

(k, s)-Lupanov representation in 81 

logic gate implementation of 16 

maxterm of 43 

minterm of 42 

negations 409(*), 410 411 

sum of 44 
carry-generate 103 
carry-propagate 103 
carry-terminate 103 
ceiling 13 

characteristic 13 375 
circuits that compute 39(*) 
comparator 270 
complete 11119 
composition 231 

computation, by standard TM 210 
computed by, 

circuit 38(*), 392 

DTM 119 

FSM 22 92 94(*), 95 

straight-line program 38 

TM 230(*) 
domain 1 1 



function(s) (cont.) 

error 451 
exponential 13 
finite 12 
floor 13 

implication 410 
linear 13 
logarithmic 13 
monotone 85 392 418 
naming 16 
next-set 154 
next-state 18 

DFSM 154 

DTM 119 

NDTM 120 

RAM 120 

standard TM 210 
output 18 
pairing 382 
partial 1 1 

DTM 119 333 

recursive 231 232 233(*) 

standard TM 210 
polynomial, as real number functions 1 3 
predecessor 232 
prefix 55 

parallel, circuit for 57 
primitive recursive 23 1 (*) 
projection 23 1 
proper subtraction 232 
punctured threshold 410 
quadratic 14 
range 1 1 

rate of growth 13(*) 
real number use by 1 2 
realizing subfunction of 47 
reductions between 46(*) 
semi-disjoint 421 
slice 43 1(*) 
space-bounded 342(*) 
successor 231 
superpolynomial 330 
symmetric 74(*) 
total 210 
transition 212 
truth table 12 40 
zero 23 1 
Furst, M.459 610 
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Gabarro, J.389 606 

Galil, Z.389 457 603 609 610 615 

game(s), 

communication, 
complexity of 437 
monotone 441 442 
geography 369 
I/O limited 531 
memory-hierarchy pebble 533(*) 

rules 533 
monotone communication, adversarial 

strategy 447 
on graphs, PSPACE-complete problems 

relationship 365 
pebble 24 25 

basic lower bounds method 470 
branching program comparison with 493 
brief history 6 
lower bounds 470(*) 
playing 463 (*) 

red-blue 26 530(*), 532(*), 542 
space— time tradeoff analysis with 461 
worst-case tradeoffs 483(*) 
universal vs existential 369 
gap theorem316 337 
Garey, M. R.389 610 
Gaskov, S. B.89 610 
gate(s)392 
circuit 38 
logic 16 
Gaussian elimination274 
GENERALIZED GEOGRAPHY language370 
Gentleman, A. M.323 610 
geography game369 
Gecseg, F.620 
Gibbons, A.323 610 
Gilbert, E.N.457 610 
Gilbert, J. R.526 610 
global routing networks310(*) 
Goldmann, M.458 610 
Goldschlager, L. M.323 389 390 610 
grammarl81 

context-free 22 183 

Chomsky normal form 187 
context-sensitive 183 
phrase-structure 182 
regular 153 184 
graphs, 

bipartite 467 



graphs (cont.) 

bisection width, network 287 
butterfly, 

as ascending algorithm 301 
comparator network replacement with 

273 
FFT 238 
network 289 
circuit as 37 
configuration 218 334 
diameter, network 287 
directed, 

adjacency matrix relationship 248 
paths 249 
directed acyclic 1 
circuits 238 
logic circuit as a 16 
embedding problem 289 
FFT 463 
hypercube 288 
inner product 541 
mesh 288 
path in a 1 
pizza pie 600 
pyramid 465 

pebbling 466 
trees 288 
undirected 10 
Greenlaw, R.389 390 610 
grep commandl68(*) 
Grigoriev, D. Yu.471 527 610 
Grigoriev's lower-bound method468(*), 470 

471 472(*) 
Guibas, L.J.601 602 610 



H 

H-tree VLSI chip layout581(*) 

matrix-vector multiplication 582(*) 
prefix computation on 583(*) 

Hagerup, T.323 606 

Haken, A.457 610 

HALF-CLIQUE CENTRAL SLICE, 

function 435 

language 435 
HALT (halt register) 1 11 

simple CPU design spec 138 
halt state, 

DTM 119 

TM, nondeterministic 120 
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halting, 

problem 227 228 
halting (cont.) 

program 113 
HAMILTONIAN PATH language387 
Harper, L.H.456 610 
Hartmanis, J.336 389 610 61 1 
Hastad,J.458 459 610 
Hatcher, P. J.323 611 
Heideman, M. T.279 611 
height, parse treel86 
Heintz, C.A.603 611 
Hennessy, J. 5 32 61 1 
Herley, K. T.323 607 
Hewitt, C. E.526 573 616 
hierarchy, 

Chomsky 5 1 82 

memory, 

HMM 562(*) 

tradeoffs, (chapter) 529(*) 

space 336(*) 

rime 336(*) 
Hillis,W.D.323 611 

history of theoretical computer science4(*) 
HMM (hierarchical memory model) 562(*) 

cost of problems in 565 

lower bounds 564(*) 

upper bounds 567(*) 
Hochschild, P.602 611 
Hockney, R.W.322 611 
Hodes, L.456 611 
Hong,J.-W.537 573 611 
Hong-Kung lower-bound method537(*) 
Hoover, H. J.72 88 389 390 455 606 610 61 1 
Hopcroft, J. E.207 236 278 279 389 526 605 

611 
Horn clause385 
Hromkovic,J.603 611 
Huffman, D.A.207 611 
hypercube(s)288 

based machines 298(*) 

broadcasting on 303(*) 

cycle shifting on 303(*) 

embedding arrays in 299C*) 

fast matrix multiplication on 308(*) 

normal algorithms 30 1(*) 

PRAM simulation 313(*), 315(*) 

sorting algorithm 302 

summing on 302(*) 



I 

I/O (input/output) 26 

block 557 

in the MHG 55 5 (*) 
bounded problem 540 
bounds, matrix-vector product 539 
capacity 532 
complexity 563 
bounds 539 
brief history 6 
I/O time bounds 535 
limited, 
game 531 

memory hierarchy game 533 
models, 

block-transfer 559(*) 
RAM-based 559(*) 
operations 26 559 

pebbling strategy 531 
simple 560 
pads, VLSI layout 577 
time 559 
MHG 534 
pebbling strategy 531 
space tradeoffs 24(*), 539(*), 54l(*), 
546(*), 552(*) 
time bounds 536 
for convolution 553 
for FFT 547 551 
in MHG 544 549 
red-blue pebble game 537 542 
ideal PRAM3 12 
idempotence, semirings252 
identities, 
matrix 240 

regular expressions 160 
rings 239 
Immerman, N.389 61 1 
Immerman-Szelepscenyi theorem344 
implicant4 1 7 
implication function4 1 
impossibility theorem24 95 96 
in_wrdl 10 
in-degree 1 

incrementing/decrementing counter 1 48 
independence properties, 
cyclic shifting functions 474 
DFT 479 

matrix multiplication 470 
wrapped convolution 473 
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INDEPENDENT SET Ianguage357 
indirect storage access function404 407 
induction 1 5 (*) 
inequalities, 

computational 23(*), 95 
for FSM 95(*) 
for interconnected FSMs 97 
for random-access memory 1 17(*) 
forTM127(*) 
RAM 118 
VLSI chips 587(*) 
Markov's 5 1 5 
initial state92 214 
DFSM 154 
DTM 119 
NDTM 120 
PDA 177 
TM, standard 210 
initialization, 

red-blue pebble game 530 
inner product24 1 
graphs, pebbling 472 
matrix multiplication 242 
alphabet 92 
choice 99 

NDTM 120 
operation, red-blue pebble game 530 
vertex 10 
INR (input register) 1 1 1 

simple CPU design spec 138 
insertion sorting network270 
instruction, 

assembly language 112 
direct memory 140 
indirect memory 140 
set, simple CPU 140(*) 
variable 143 
integer(s) , 

addition function 23 1 
INTEGER PROGRAMMING language 362 
multiplication 475(*) 
algorithm 63 
binary function, space— time lower bound 

475 
function 232 

function, space-time lower bound 475 
space-time lower bound 507 
representation 8 58 
integrated circuits575 
interconnection, synchronous FSM97(*) 
interleaved random-access memory556 



interrupt 139 
intersection, 

CFL, not closed under 199 
languages 170 
sets 7 
inversion, 
DFT 265 
matrix 243 252(*) 
algorithm 260 
Borodin-Cook lower-bound method 

application 5 1 1 (*) 
Csanky's algorithm 262 
fast 260(*) 
function, reduction from matrix 

multiplication to 253 
function, triangular matrices 256 
non-singular 243 

reduction to SPD matrix inversion 254 
space-time lower bound 512 
rings 239 

triangular matricies 255(*) 
isomorphism, 

DFSM, conditions for 174 
Iverson, K.323 611 



J 

JaJa,J.279 323 527 602 611 
Jesshope, C. R.322 611 
Johnson, D.45 5 611 
Johnson, D.H.279 611 
Johnson, D. S.388 389 610 612 
Johnson, R.B.603 612 
Jones, N. D.389 612 
jump value, 

for space 483 
Juurlink, B. H. H.323 612 
(fe, s)-Lupanov representation80 82 



K 

Karatsuba, A.88 612 

Karchmer, M.458 612 

Karlin,A. R.323 612 

Karp, R. M.88 152 323 388 389 390 609 612 

Kasami, T.207 612 

Kedem, Z. M.602 612 

Khachian, L. G.353 389 612 

Khang, S. M.601 622 

Khasin, L. S.456 612 
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Kirkpatrick, D. G.528 607 
Klawe,M. M.455 528 611 612 
Kleene, 

closure 9 158 

CFL closed under 198 
NFSM acceptance of 163 

star 158 
Kleene, S.C.236 612 
Kloss, M.456 612 
Knuth, D. E.32 279 323 613 
Kohavi, Z.611 
Komlos, J.274 456 457 606 
Koutsoupias, E.456 613 
Krapchenko lower bound407(*) 
Krapchenko, V. M.88 456 613 
Krichevskii, R. E.456 613 
Kronecker product, 

nice matrices 503 

three-matrix product in terms of 5 1 1 
Kumar, V.K.P.602 611 
Kung, H. T.323 537 573 601 602 606 608 
609610611 612613617618 
Kuroda, S.Y.207 613 



L decision problem338 
L decision problem338 

Laaser,W.T.389 612 
Ladner, R. E.152 389 613 
Lamagna, E. A.457 613 
Landweber, P. S.207 613 
language (s) 181 

2-SAT 363 

3-sat 356 

3-COLORING 359 
ACCEPTANCE 215 

byNDTMand DTM 215 216 
DTM 119 333 

LIMITS 223(*) 

NDTM 120 333 
ALTERNATING QUANTIFIED 
SATISFIABILITY, 

PSPACE-complete language 369 
assembly 112 140(*) 

instructions 112 
associated with a decision problem 329 
CFL, Chomsky normal form 187 



language(s) (cont.) 

CIRCUIT SAT 132 355 
CIRCUIT SATISFIABILITY 128 
CIRCUIT VALUE 128 130 131 352 
closed under an operation 170 
complements 170 329(*), 330 
complete 130(*) 
context-free 22 153 183(*) 

closure properties 198(*) 

parsing 186(*) 

PDA acceptance 192(*) 
context-sensitive 182 183(*) 
decidable 223 (*) 

decision problems relationship to 328(*) 
difference between 170 
DTM ACCEPTANCE 354 
efficiently parallelizable 380(*) 
element distinctness 233 
equivalence relations on 1 7 1 (*) 
EXACT COVER 360 
existence of languages not in P 343 
finite 9 
formal 21 181 (*) 

Chomsky language hierarchy 5 
FSM, 

described by regular expressions 164(*) 
GENERALIZED GEOGRAPHY 370 
HALF-CLIQUE CENTRAL SLICE 435 
HAMILTONIAN PATH 387 
INDEPENDENT SET 357 358 
infinite 9 

INTEGER PROGRAMMING 362 
intersection of 170 171 
LINEAR INEQUALITIES 353 
machine 140 

MONOTONE CIRCUIT VALUE 150 353 
NAESAT 353 356 
NC380 

NDTM machine recognition 215 
non-recursively enumerable 224 
NP26 120 

condition for P = NP 130 

relationship to NDTM 26 
NP-complete, 
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language(s) (cont.) 
NP-complete (cont.) 

brief history 5 

reduction to 132(*) 
NSPACE, recognition by uniform circuit 

family 375 
P 120 

condition for P = NP 130 
P-complete, 

brief history 5 

log-space reduction 131 

reduction to 130(*) 
P/poly 382(*), 383 
phrase-structure 182 182(*), 219(*) 

are recursively enumerable 220 

recursively enumerable languages are 2 1 9 
programming, brief history of 4 
properties, context-free 197(*) 
QUANTIFIED SATISFIABILITY 365 366 367 
recognition 215 

byFSM22 92 154 

byTM210 

(chapter) 153 

DFSM 154 

NFSM 154 

TM119 
recursive 210 

DTM 333 
recursively enumerable 210 223 224 (*) 

as Chomsky hierarchy component 5 

but not decidable 225(*) 

phrase-structure relationship 219 220 
reducibility 226(*) 
regular 22 153 158 170(*), 184(*) 

as Chomsky hierarchy component 5 

conditions for 174(*) 

conditions for finite and infinite 169 

machine type that corresponds to, (table) 
182 
SATISFIABILITY 132 133 353 356 
strings and 9(*) 
SUBSET SUM 361 
TASK SEQUENCING 361 
undecidable 228 229 230 
unsolvable 223 
verification of 121 
Iatency26 284 
layout, VLSI577(*) 

LDL T factorization of SPD matrices257(*) 
Le Blanc, Jr., R. J.609 
Lehman, P. L.601 614 



Leighton, F. T.323 603 613 614 
Leiserson, C. E.323 601 613 
lemmas, 

approximator circuits, 

on negative test inputs, (9.6.6) 427 
on negative test inputs, (9.6.7) 428 
positive test inputs, (9.6.8) 429 
basis change effect on circuit size and depth, 

(9.2.3) 396 
binary trees, 

longest path length, (11.9.1) 565 
longest path length for sorting, (1 1.8.1) 

560 
number of unlabeled, (2.12.2) 78 
Boolean, 

function negations, circuits for, (9.5-1) 

410 
matrices, powers of, (6.4.1) 248 
matrix multiplication by monotone 

circuits, (9.6.3) 423 
matrix multiplication on monotone 
circuits, (9.6.4) 423 
branching program, 

pebble-game lower bound from, (10.9.3) 

494 
RAM lower bound from, (10.9.4) 494 
ST lower bound for, by reductions, 
(10.11.2) 500 
circuits, 

for cyclic shifting, (2.5.1) 50 
for demultiplexer function, (2.5.6) 55 
for multiplexer function, (2.5.5) 55 
for next-state/output functions, (3.5.1) 

120 
size bound for indirect storage access 

function, (9.4.1)405 
size, relationship between planar and 
standard, (12.6.1) 586 
class Q$ of Boolean functions, (9.3.1) 401 
clique function, positive test inputs for, 

(9.6.5) 425 
clique lower bound technical lemma, 

(9.7.3) 445 

(9.7.4) 446 

(9.7.5) 446 

communication complexity no more than 

depth, (9.7.1)438 
commutative rings, 

example, (6.7.1) 264 

example, (6.7.2) 264 
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lemmas (cont.) 

comparator-based merging networks, 

disjoint paths in, (10.5.5) 481 
counting, function, circuit for (2.1 1.1) 75 
CREW PRAM simulation by circuits, 

(8.14.1)377 
cyclic shifting independence properties, 

(10.5.2) 474 
decoder function, circuit for, (2.5.4) 54 
decomposition of trees into subtrees, (9.2.4) 

397 
depth no more than communication 

complexity, (9.7.2) 439 
DFT, 

independence properties, (10.5.4) 479 
vector-matrix product is 1 /4-ok, 
(10.13.5)513 
distinguishability properties, 

flow property relationship to, (10.11.1) 

500 
wrapped convolution, (10.13.1) 505 
encoder function, circuit for(2.5.3) 52 
errors with, 

fe-approximator of, (9.7.6) 448 
-v/n-approximator of parity, (9.7.7) 450 
fan-out- 1 circuits and formula size), 
relationship between, (9.2.2) 394 
FFT decomposition, (6.7.4) 267 
functions, 

cyclic shifting, circuit for, (2.5.1) 50 
cyclic shifting, reductions between logical 

and (2.5.2) 51 
realizing subfunction of, (2.4.1) 47 
I/O time bounds, reductions between, 

(11.3.2) 536 
inverse DFT, (6.7.3) 265 
Kronecker product of nice matrices, 

(10.12.2) 503 
lower bounds, indirect storage access 

function, (9.4.2) 407 
matrix, 

multiplication distinguishability 
properties, (10.13.3) 509 
multiplication, flow properties, (10.5.3) 

477 
multiplication, independence properties 

of, (10.4.1)470 
multiplication, 5-span for, (11.5.1) 541 
nice (10.12.1) 501 
product, inverting, (6.2.1) 243 



lemmas (cont.) 
matrix (cont.) 

vector product distinguishability 
properties, (10.13.2) 508 
maximal path length in DAG, (6.4.2) 249 
monom removal, (9.6.1) 418 
normal-form branching programs equivalent 

to general ones, (10.9.2) 492 
pebbling, 

balanced binary trees, (10.2.1) 465 
pyramid graph, (10.2.2) 466 
pigeon-hole principle, (1.3.2) 16 
planar circuit, size lower bounds for 

independent functions, (12.7.1) 
593 
planar separator theorem, 
conditional, (12.6.2) 590 
multi-set, (12.6.4) 592 
two-cost, (12.6.3) 592 
PRAM and log-space uniform circuit 

relationship, (8.14.2) 378 
proof by induction example, (1.3.1) 15 
pumping 153 

application of, (4.13.2) 198 

CFL, (4.13.1) 197 

finite and infinite regular languages, 

(4.5.2) 169 
regular languages (4.5.1) 169 
QUANTIFIED SATISFIABILITY language, 
log-space hard, (8.12.2) 367 
PSPACE-complete, (8.12.1) 366 
realization of log-space transformations on 

CREW PRAM, (8.14.3) 380 
realizing subfunction of a function, (2.4.1) 

47 
reduction, 

between logical and cyclic shifting 

functions, (2.5.2) 51 
from matrix multiplication to matrix 

inverse, (6.5.1) 253 
of matrix inversion to SPD matrix 

inversion, (6.5.2) 254 
of shifting to multiplication, (2.9.1) 68 
of squaring to reciprocal function, 

(2.10.1)73 
use of, (5.8.1)227 
regular languages, conditions for finite and 

infinite, (4.5.2) 169 
replacement rule semi-disjoint bilinear form, 

(9.6.2) 420 
rooted tree fan-in, properties of, (9.2.1) 394 
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lemmas (cont.) 

S-span for FFT, (11.5.2)546 

Schur complement of SPD matrix is SPD, 

(6.5.3) 255 
simple I/O time lower bounds, (1 1.3.1) 535 
simulation of decision branching programs 
by general branching programs, 
(10.9.1)491 
slice functions, representation, (9.6.9) 431 
squaring function, (2.9.2) 68 
states, equivalence relation refinement, 

(4.7.1) 175 
superconcentrator, 

linear-size, existence of, (10.8.1) 485 
technical lemma on, (10.8.2) 486 
technical lemma on, (10.8.3) 486 
three-matrix product in terms of Kronecker 

product, (10.13.4) 511 
tree circuit, for binary functions, (2.12.1) 78 
unique elements, 

distinguishability properties, (10.13.7) 

515 
technical lemma, (10.13.6) 514 
unsolvability, (5.8.1) 227 
wrapped convolution, 

distinguishability properties, (10.13.1) 

505 
independence properties, (10.5.1) 473 
Lengauer, T.482 526 528 601 602 610 614 
length, 
path 10 
strings 9 
Leon, S.J.614 
level, 

multigraphs 492 
Leverrier's theorem26 1 
Levin, L.A.88 389 614 
Lewis, H.R.236 389 614 
Lewis II, P. M.389 610 
lexical analysis 181 
lexicographical order222 
Li, Ming618 
Liang, F.M.602 610 
linear, 

arrays 292 293(*), 294 304(*) 

bounded automaton 182 204 

combination, matrix 242 

equation systems 241 242 262(*) 

equations, solutions 263 

functions, as real number functions 13 



linear (cont.) 

independence, matrix 243 
recurrence, first order, of length n 86 
LINEAR INEQUALITIES language, 

inequalities 353 
Lingas,A.526 6l4 
Lipton, R. J.601 602 612 614 
list, 

adjacency 30 
ranking problem 32 1 
literals, 

positive 385 

SATISFIABILITY language 132 
load balancing56 
local routing networks309(*) 
locality of reference558 
log-space, 

computations 342 

hard for PSPACE, QUANTIFIED 

SATISFIABILITY language 367 
P-complete problems, justification for 352 
programs 129 

PSPACE-complete problems, 
ALTERNATING QUANTIFIED 

SATISFIABILITYlanguage 369 
GENERALIZED GEOGRAPHYlanguage 

370 
QUANTIFIED SATISFIABILITY language 

367 369 
reduction 131 
TM, composition of 351 
transformations, 

on CREW PRAM, realization of 380 
transitivity of 350 
uniform, 
circuits 373 
circuits 374 
PRAMs 377 
logarithm functions 13 
logic, 

circuits 392 

(chapter) 35(*) 
computational model 16(*) 
computational model, VLSI 579 
dual-rail 84 
gate 16 
mathematical, as foundation for theoretical 

computer science 4 
operations 48(*) 
logical shifting reduction50(*), 68 
LogPmodei317(*) 
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loosely coupled, 

computer network 284 
Loui, M. C.526 528 614 
lower triangular, 

matrix 240 
LRU (least recently used) page-replacement 

algorithm568 
Luccio, F.618 
Luk,W. K.602 614 
Lupanov, O. B.89 614 
Lynch, N. A.528 607 



M 

machine(s) , 

language 140 
programs 141 

with memory, 

See also FSM; PDA; RAM; Turing 

machine 
(chapter) 9 1(*) 
main diagonal, 

matrix 240 
many-to-one reductions227 
mappings 1 1 

state-to-state 101 
MAR (memory address register) 1 1 1 

simple CPU design spec 138 
marker, 

end-of-tape 210 
Markov's inequality5 1 5 
Maruoka, A.424 457 606 
masks576 
mathematical, 

logic, as foundation for theoretical computer 
science 4 

preliminaries 7(*) 
matrix(s)ll 240 (*) 

addition function 242 

adjacency 1 1 248 

bad 504 

block 243 

Boolean, powers of 248 

characteristic polynomial 260 

circulant 244 

decomposition 246 

good 504 

identity 240 

inversion 252(*) 
algorithm 260 



matrix(s) (cont.) 
inversion (cont.) 

Borodin-Cook lower-bound method 
application 5 1 1 (*) 

Csanky's algorithm 262 

fast 260(*) 

function, reduction from matrix 
multiplication to 253 

function, triangular matrices 256 

reduction to SPD matrix inversion 254 

space-time lower bound 512 
linear combination 243 
lower triangular 240 
main diagonal 240 
multiplication 242 244(*), 477(*), 509 

application to parsing CFL's 190 

Boolean 244 422 

Borodin-Cook lower-bound method 
application 509(*) 

family of inner-product graphs 541 

fast, on a hypercube 308(*) 

flow properties 477 

independence properties of 470 

on a 2D mesh 295(*) 

on a hypercube 308 

on linear arrays 294 

reduction to matrix inversion 253 

reduction to transitive closure 250 

S-span for 541 

size and depth bounds 247 

space-time lower bound 472 479 511 

space-I/O time tradeoffs 54 1(*) 

standard algorithm 422 

Strassen's algorithm 245(*), 247 

three-matrix product space-time lower 
bound 512 
nice 501 

Kronecker product 503 
non-singular, inverse 243 
ok 501 

permutation 244 477 
product, inverting 243 
properties, 

nice 50 1(*) 

ok501(*) 
rank 243 

scalar product 240 
SPD 253(*) 

LDL factorization of 257 

reduction of matrix inversion to 254 

Schur complement is SPD 255 
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matrix(s) (cont.) 

square 240 

standard matrix multiplication algorithm 

242 
symmetric 240 
Toeplitz 243 
trace 261 

transitive closure 248 
transpose 240 

triangular, inversion of 25 5(*) 
upper triangular 240 
Vandermonde 265 
vector product 241 

Borodin-Cook lower-bound method 

application 507(*) 
DFT513 

distinguishability properties 508 
on a linear array 293(*) 
on an H-tree 582(*) 
space-I/O time tradeoffs 539(*) 
space-time lower bound 508 
zero 240 
Mauchly4 
Maurer, H. A.618 
maxterm43 

monotone 441 
McColl,W. F.602 614 
McCulloch,W. S.207 615 
McNaughton, R.207 615 
MDR (memory data register) 111 

simple CPU design spec 138 
Mead, C.A.323 601 613 615 
Mealy, G.H.I 52 207 615 
Mealy machine200 

FSM93 
Mehlhorn, K.323 457 458 601 602 606 614 

615 
memory, 
address 111 

MAR, simple CPU design spec 138 
sequence 568 
bounded, RAM 19 111 122 
distributed, 
computer 285 
routing in 309 
shared, computer 285 
fast, simulation in MHG. 558(*) 
hierarchical models 562(*), 563 
pebble game, See MHG 
tradeoffs, (chapter) 529(*) 
interleaved random-access 556 



memory (cont.) 

locality of reference 558 
machines with, (chapter) 9 1 (*) 
management 567 

algorithms, two-level 568(*) 

competitive 567(*) 
number of gates, RISC and CISC CPUs 

compared with 138 
organizations, language relationship to 5 
page-replacement algorithms 567 

FIFO 568 

LRU 568 

MIN 568 
parallel computers with 283(*) 
random-access 114(*) 

circuit 116 
shared, computer 284 
unbounded, RAM 1 1 1 
units, clocked 106 

virtual memory-management systems 567 
merging, 

bitonic, sorting via 27 1(*) 

block, algorithm for 561 

efficient branching programs for 496(*) 

monotone circuits lower bounds for 414 

networks 270(*),481(*) 

Batcher's bitonic 271 273 

comparator-based 481 

space-time lower bound 482 
problem 270 
mesh(es)288 
2D arrays, 

embedding ID arrays in 297(*) 

fully normal algorithms on 306(*) 

matrix multiplication on 295(*), 296 

normal algorithm on 307 

simulation on ID array 298 
multi-dimensional 292 292(*) 

layouts, VLSI chips 583(*) 
row-major order 293 
toroidal 293 
oftrees319 599 
message, 
passing 284 
priority 316 
metal migration577 
Meyer, A. R.389 456 609 619 
Meyer auf der Heide, F.528 607 615 
MHG (memory-hierarchy pebble 
game) 533 (*) 
block I/O in 555(*) 
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MHG (memory-hierarchy pebble game) 
(cont.) 

convolution bounds 553 
fast memory simulation in 558(*) 
I/O time bounds, 
forFFTin551 

for matrix multiplication in 544 
on FFT graph, bounds 549 
playing 534 
rules 533 
Micali, S.607 610 
micro cycle 139 
micro-instructions 139 
simple CPU 142 

affecting registers 145 
microcode, 

execute portion 143 
simple CPU 142 
Miller, G. A.207 608 
Miller, R. E.612 
MIMD (multiple instruction, multiple 

data)285 
MIN, 

page-replacement algorithm 568 
minimal, 

DFSM, equivalence relation 177 
FSM, 

algorithm 175(*) 
algorithm for 176 
conditions for 174 
pebbling 534 
strategy 531 
minimization232 
state 17 1(*) 
problem 158 
minimum, 

feature size, VLSI chip wires 578 
space, existence of graph requiring large 488 
minterm42 

monotone 441 
MISD (multiple instruction, single data)286 
models, 

branching program 488(*) 

BSP317C) 

circuit 372(*), 392(*) 

parallel memoryless computational 282 
computational 16(*) 

branching program comparison 

with 493 (*) 
logic circuits as 16(*) 
parallel 282(*) 



models (cont.) 

computational (cont.) 

(part I - chapters 2-7) 35 

restricted, representing 217(*) 

serial 33 1(*) 

VLSI, (chapter) 575(*) 
data parallel 286(*), 286 
FSM 154(*) 

hierarchical memory 562(*), 563 
I/O, 

block- transfer 559(*) 

RAM-based 559(*) 
LogP317(*) 
machine, 

parallel 29 

sequential 5 
MIMD 285 
MISD 286 
nondeterministic 4 
PRAM32 311376(*) 

as canonical structured parallel machine 

311C0 

RAM 19(*) 

role and types 3 

SIMD 285 

SISD 285 

SPMD, data parallel model implementation 

by 287 
TM, standard 21 0(*) 
VLSI, 

computational 579(*) 
diffusion 579 
physical 578(*) 
synchronous 579 
transmission 579 
transmission-line 579 
modulo-p counterl48 
modulus, 

functions, as symmetric function 74 
Monier, L. M.601 608 
monom4 1 7 

removing 418 
monotone, 
basis 392 

circuits 27 353 392 
communication game 44 1 

rules 442 
depth, communication complexity 

relationship 440(*) 
functions 85 392 

replacement rules 418 
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monotone (cont.) 

implicant 417 

increasing 392 

maxterm 44 1 

minterm 441 

prime implicant 417 
MONOTONE CIRCUIT VALUE languagel50 

353 
Moore, E.F.93 152 207 615 
Moore machine, 

FSM93 
Muller, D. E.88 89 455 457 615 617 
multi-dimensional meshes292 292(*) 
multi-tape, 

TM 119 
multigraphs489 

level 492 
multilective computation580 
multiplexer54(*) 

function, circuit for 55 
multiplier, 

carry-save, circuit for 66 

divide-and-conquer, circuit for 67 
multiway decision tree561 
Munro, 1.279 607 
Myhill,J.174 207 615 
Myhill-Nerode theoreml 74(*) 



N 

n-indistinguishable states 175 
NAESAT language356 
naming function 16 
Nassimi, D.323 609 
natural numbers 
NC languages380 
NDTM (nondeterministic Turing 
machine) 120(*), 214(*) 

See also, DTM; Turing machine (TM) 

language acceptance 333 
by both DTM and 215 216 

multi-tape 333 

DTM simulation of 216 

NP language 120 
relationship to 26 

NP problems 335 

one-tape 333 

recursive language 333 
Neciporuk, E. 1.456 457 615 
Neciporuk lower bound405(*) 



near-ring86 
negations, 

Boolean function 409 410 
negative literal385 
neighborhood, 

set 408 
Nerode, A. 174 207 615 
networks, 

Benes, global routing network example 310 

brief history 6 

CCC 307 

normal algorithms on 308 

comparator 270 271 

computer 284 287(*) 

connection 289 

crossbar 289 

hypercube, PRAM simulation 315(*), 317 

merging 270(*), 481 (*) 
Batcher's bitonic 271 273 
comparator-based 481 
space-time lower bound 482 

mesh of trees 319 

permutation 310 

routing 309(*) 

sorting 270 (*) 
Newton approximation algorithm69 
NEXPTIME class337 
next-set function, 

NFSM154 
next-state function 18 92 214 

DFSM 154 

DTM 119 

NDTM 120 

standard TM 210 
NFSM (nondeterministic finite-state 
machine)98(*), 154 

acceptance of 1 63 1 64 

DFSM equivalence 156 

Kleene closure acceptance by 163 

languages accepted by, same as languages 
accepted by DFSMs 156 

regular expression recognition by 160 161 
162 

regular language acceptance 185 
nice matrices501 

Kronecker product 503 
Nishino, T.456 606 620 
NL problems338 

2-SAT language in 363 

complexity class relationships 341 
no-opll5 
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Nodine, M. H.573 615 
non-ambiguous languages 187 
non-redundant, 

branching program 490 
non-singular, 

matrix 243 
non-terminal, 

phrase-structure grammar 1 82 

self-embedding 206 

symbols 22 
nondeterministic98 

FSM, See NFSM 

models 4 

PDA 177 

Turing machine, See NDTM 
normal algorithms301(*) 

on 2D array 307 

ascending 306 

AT 2 upper bound for 585 

on CCC networks 308 

cyclic shifting on the hypercube 304 

fully 30 1306 
normal form, 

Boolean function expansions 42(*) 

branching program 492 

comparison of 45(*) 

conjunctive 43 

disjunctive 42 

product-of-sums 44(*) 

ring-sum 45(*) 

standard circuit construction methods 40 

sum-of-products 44(*) 
normalization263 
notation, 

big Oh, 0{) 13 

big Omega, t7() 13 

bigTheta, G( ) 13 

binary relations 9 

computational work done by a FSM 24 

empty, 
set 7 
string 9 

equivalence classes 1 

integer operations 8 

positive closure 9 

product, equivalent number of logic 
operations employed 24 

register transfer 142 

set 7 9 
NP (nondeterministic polynomial time) , 

complement of 347(*) 



NP (nondeterministic polynomial time) 
(cont.) 

complete, 

problems that are 127 
reducibility used to identify 227 
simulation use to show 23 
complete language 130(*) 
brief history 5 
complexity theory role 128 
reduction to 132(*) 
distinguishing P from, circuit complexity as 

method for 391 
equal to P question, as outstanding 

computer science problem 121 
language 120 

condition for P = NP 130 
relationship to NDTM 26 
P as subset of 121 
problems 335 
NP-complete problems355(*) 
3-COLORING language 359 
3-SATlanguage 356 
boundary between P-complete problems and 

363(*) 
CIRCUIT SAT 355 
EXACT COVER language 360 
HALF-CLIQUE CENTRAL SLICE language 

435 
INDEPENDENT SET language 357 358 
INTEGER PROGRAMMING language 362 
justification for 352 
NAESAT language 356 
SATISFIABILITY language 356 
slice functions 435 
SUBSET SUM language 361 
succession of reductions 358 
TASK SEQUENCING language 361 
NPSPACE, 

complexity class relationships 341 
decision problem 338 
language 375 
NSPACE(r(n))334 

TIME(r(n)) relationship with 341 
NTIME(r(n))334 

SPACE(r(n)) relationship with 341 
number(s), 
natural 8 
systems 8(*) 
number system, 

complementary 432 433 
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oblivious, 

data 309 
odd-even, 

transposition sort 294 
Oettinger, A. G.207 615 
offline algorithm567 
Ofman, Yu.88 456 612 616 
ok matrices501 
one-dimensional meshes292 
online algorithm567 
OPC (operation code register) 111 

simple CPU design spec 138 
operator, 

associative 48 56 102 
balanced binary tree 48 
oracle, 

function 216 

tape 216 333 
oracle Turing machine, 

See, OTM 
ordering, 

snake row 297 
OTM (oracle Turing machine)216 333 
out_wrdl 10 
out-degree 1 
output, 

alphabet 92 

function 1 8 92 

next-state/output RAM functions, 
circuits for 120 

operation, red-blue pebble game 530 

vertex 10 
OUTR (output register) 1 1 1 

simple CPU design spec 138 
overflow, 

addition 6 1 



P = NP problem, 

importance of 336 

outstanding computer science problem 121 
TM complexity vs circuit size complexity as 
tool for resolving 128 
P (polynomial time) , 

algorithm, CFL recognition 189 
characteristics 5 



P (polynomial time) (cont.) 

complexity class relationships 341 

to each other and to 381 
distinguishing NP from, circuit complexity 

as method for 391 
existence of languages not in 343 
hard problems, LINEAR INEQUALITIES 353 
log-space contained in 342 
problems 130(*), 328(*), 335 
reduction 132 
P-complete problemsl20 130 352(*) 

boundary between NP-complete problems 

and 363(*) 
brief history 5 
complexity theory role 128 
condition for P = NP 130 
CREW PRAM solutions 380 

DTM ACCEPTANCE 354 
examples of, CIRCUIT VALUE language 128 
justification for 352 
log-space reduction, 131 
MONOTONE CIRCUIT VALUE 353 
problems that are 127 
reduction to 130(*) 
subset of NP 121 
P/poly languages383 
page, 
fault 567 

replacement algorithms 567 568 
pairing function382 
Pan, V. Y.607 
Papadimitriou, C. H.152 236 347 389 390 

602 614 616 
parallel, 

algorithms, performance 289(*) 
computation 27(*) 
(chapter) 281 
circuit models 372 
models 282(*) 
thesis 379 (*) 
computers 282 284 
Amdahl's law 290(*) 
asynchronous 285 
Brent's principle 29 1 (*) 
Flynn's taxonomy 285 
memoryless 282(*) 
synchronous 285 

unstructured, circuit as form of 283 
with memory 283(*) 
data model 286(*) 
languages, efficiently parallelizable 380(*) 
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parallel (cont.) 

machines, 

P-complete language problem 128 
PRAM 6 27 29 31 1(*) 
prefix, circuits, efficient 57(*) 
parallelizable38 1 
parity, 

bounded-depth parity circuits, 
exponential size 448(*) 
exponential size of 450 
communication problem 438 
function 43 
parsing22 181 
CFL 186(*) 

Cocke-Kasami- Younger algorithm 189 
parse tree 1 86 

Chomsky normal form 197 
parser 186 
partial, 

computations 210 211 
functions 11210 

DTM computation 333 
NDTM computation 333 
recursive 232 233(*) 
standard TM 210 
TM119 
partition, 

balanced 425 
Paterson, M. S.89 455 456 457 458 526 573 

602 609 614 616 
path(s)10 

directed graph 1 

maximal path length 249 
elimination method, monotone circuits, 
lower bounds derivation 413(*) 
external length 45 1 
length 10 

binary search tree 564 
longest, 

binary tree 565 
for sorting, binary tree 560 
monotone circuits 414 
rich 499 

unddirected graph 1 
vertex-disjoint, monotone circuits 415 
Patterson, D.323 532 609 61 1 
Paul, W.J.455 456 526 611 616 
Paz,A.611 
PC (program counter) 111 

simple CPU design spec 138 
PDA (pushdown automata)20 177(*) 



CFL acceptance 192 192(*) 
(chapter) 1530*) 
PDA (pushdown automata) (cont.) 

computational model 5 20(*) 
languages accepted by, are context-free 1 94 
one-way input tape 178 
stack 178 

state diagram 179 180 
TM relationship 217 
pebble game24 

basic lower bounds method 470 
branching program comparison 

with 488 (*), 493 
brief history 6 
lower bounds 470(*) 
memory-hierarchy 533(*) 
pebbling, 

balanced binary trees 465 

FFT graph 463 

inner product graphs 472 

minimal 534 

pyramid graph 466 

strategy 531 558 
playing 463(*) 
red-blue 26 

deletion of pebbles 530 

playing 532(*) 

rules and strategies 530(*) 
relationship to red-blue pebble game 530 
rules and strategies 462 
space lower bounds 470(*), 471 
space-time tradeoff analysis with 461 
worst-case tradeoffs 483(*) 
period, 

computation 582 
VLSI chip 580 
Pedes, M.207 606 
permutation74 244 
bit reverse 267 
matrix 244 477 
network 310 

Benes, global routing network example 
310 
routing problem 309 
shuffle, on linear arrays 304(*) 
unshuffle, on linear arrays 304(*) 
Peterson, G.L.456 616 
phrase-structure languages 182(*) 
are recursively enumerable 220 
machine type that corresponds to, (table) 
182 
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phrase-structure languages (cont.) 

recursively enumerable languages are 219 

TMand219(*) 
Pietracaprina, A.323 607 
pigeon-hole principle 15 (*), 16 168 
pipelining307 
Pippenger, N.152 390 455 456 457 526 527 

528 611616 
Pitts, E.207 615 
pizza pie graph600 
planar circuit size586(*) 

lower bound, 

for independent functions 594 
in terms of w(u, w)-flow 593 

relationship between AT' and A'T and 589 
planar graph, 

face 590 

triangular 590 
planar separator theorem589(*), 591 

conditional 590 

multi-set 592 

two-cost 592 600 
Plaxton, G.323 609 
pointer doubling321 
polylogarithmicl3 79 
polynomial 1 2 

advice function 382 

characteristic, of a matrix 260 

functions, as real number functions 12 

language in NP 120 

language in P 120 

time, See P 
pop, 

PDA 177 

state 180 
ports, 

VLSI layout 577 
POSE (product-of-sums expansion)44(*) 
positive, 

closure 9 158 

instance, monotone communication game 
442 

literal 385 

test inputs 425 

approximator circuits 429 
possible accept state 179 
Post, E. L.236 616 
power, 

set 8 
Pracchi, M.601 607 



PRAM (parallel random-access machine), 

as parallel machine 27 

as synchronous shared memory model 285 

brief history 6 

characteristics of 29 

circuit relationship 378 

CRCW314 

CREW 380 

circuit equivalence 376(*), 379 
circuits and 317(*) 
simulation by circuits 377 
efficiency 290 
EREW, 

CRCW PRAM simulation 314 
simulation by hypercube network 317 
simulation of normal algorithm 313 
hypercube network simulation of 315(*) 
log-space uniform 377 378 
model 376(*) 

as canonical structured parallel machine 
311(*) 
processor-time tradeoff 290 
simulation, of trees, arrays, and hypercubes 

313(*) 
speed 290 
Pratt, V.R.389 457 617 
precise, 

TM334 
predecessor function232 
predicatel5 232 
prefix, 

circuits, parallel, efficient 57(*) 
computation 55(*), 583(*) 

segmented 56 
function 55 

parallel, circuit for 57 
Preparata, F. P.88 323 455 457 601 602 603 

606 607 614 615 617 622 
Preston, Jr., K.6 13 
primality problem347 

primality is in intersection of NP and coNP 

348 
test for 347 
prime, 

factorization 87 
implicant 417 
primitive recursive functions231(*) 
priority, 

message 316 
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PRIORITY model, 

PRAM 313 
problems, feasible 335 
complete 129 350(*), 351 
decision, 

classification issues 328(*), 334(*) 

complement of 329 

language complements and 329(*), 330 

regular languages, algorithms 171 
hard 350(*), 351 
NL338 

2-SAT language in 363 

complexity class relationships 341 
NP-complete 352 355(*) 

3-COLORINGlanguage 359 

additional examples 357(*) 

boundary between P-complete problems 
and 363(*) 

EXACT COVER language 360 

INDEPENDENT SET language graph 358 

INTEGER PROGRAMMING language 362 

justification for 352 

SATISFIABILITY language 356 

SUBSET SUM language 361 

succession of reductions 358 

TASK SEQUENCINGlanguage 361 

P = NP5 

P-complete 352 352(*) 

boundary between NP-complete problems 
and 363(*) 

DTM ACCEPTANCE 354 

justification for 352 

MONOTONE CIRCUIT VALUE 353 
P-hard, LINEAR INEQUALITIES 353 
PSPACE-complete 365 (*) 

ALTERNATING QUANTIFIED 

SATISFIABILITY language 369 

GENERALIZED GEOGRAPHY language 
370 

QUANTIFIED SATISFIABILITY language 

365 366 367 369 
state minimization 158 
TSP, NP-complete association with 5 
unsolvable 227(*) 
product, 
Cartesian 8 
Kronecker 503 
matrix-vector 507(*), 508 
variables 44 
vector-matrix, DFT 513 



program (s), 

boot 141 
branching 488(*) 

comparison with other computational 
models 493(*) 

straight-line programs vs. 496(*) 
correctness 5 
halting 113 
machine language 141 
RAM 112(*) 
recursion 375 
straight-line 17 35 238(*) 

branching programs vs. 496(*) 

circuits and 36(*) 
tree 49 1 
programming, 

dynamic, algorithm 165 
projection function231 
proof, 

by contradiction 15(*) 
by induction 15(*) 
methods of 15(*) 
propagation, 

carry 59 
proper, 

integer subtraction function 232 
subset 7 
subtraction 232 
properties, 

algebraic, of Boolean functions 40(*) 
CFL 197(*) 

closure 198(*) 

non-closure 199 
closure, regular languages 170 
distinguishability, 

(<j>, X, H, V, t) 497 

flow property relationship to 500 

matrix multiplication 509 

matrix-vector product 508 

unique elements 515 
flow 469 

distinguishability property relationship to 
500 

functions 469 (*) 

matrix multiplication 477 
independence, 

cyclic shifting functions 474 

DFT 479 

matrix multiplication 470 

wrapped convolution 473 
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properties (cont.) 

matrices, 
nice 50 1(*) 
ok501(*) 
regulat, 

expressions 159 
languages 170(*) 
rooted tree fan-in 394 
of semirings 251 
sets 8 
trees 397 
protocol, 

communication game 437 
pseudo-negations432 

realization by monotone circuits 432 
PSPACE decision problem338 
PSPACE-complete problems365(*) 
ALTERNATING QUANTIFIED 

SATISFIABILITY language 369 
GENERALIZED GEOGRAPHY language 370 
QUANTIFIED SATISFIABILITY language 365 
tree circuit 366 
Pucci, G.323 607 
Pudlak, P.456 617 
pumping lemma 53 
application of 198 
CFL 197(*) 
FSM 168(*) 

regular languages, conditions for finite and 
infinite 169 
punctured threshold function4 1 
pyramid graph465 
pebbling 466 



quadratic function 1 4 
quantification, 

existential 365 
universal 365 
QUANTIFIED SATISFIABILITY language365 

367 369 

tree circuit 366 
quasiplanar577 
query, 

superfluous 499 
Quinn, M.J.323 611 617 



Rabin, M.O.152 207 617 

radius of a rooted spanning tree589 

radix sort286 

RAM (random-access machine), 

architecture 110 (*) 

as serial computational model 33 1(*) 

based I/O models 559(*) 

bounded-memory 111 

branching program simulation 495 

circuits, next-state/output functions 120 

computational inequalities for 117(*), 118 

computational models 19(*) 

FSM 1 1 1 (*) 

memory hierarchy simulations, speed and 
size tradeoffs, (chapter) 529(*) 

programs 112(*), 113 

simulation 122 332 

space 332 
use 495 

time 332 

TM relationship to 124 

unbounded-memory 111 

universal 114(*) 
Ramachandran, V.323 388 390 612 
Ranade,A323 617 
Randell, B.32 617 
random-access memory 19 114(*) 

architectural components 110 

circuit 116 

design 115(*) 

interleaved 556 
range, 

of a function 11 
rank, 

matrix 243 
RASP (random-access stored program 

machine) 114 
rate of growth, 

functions 13(*) 
Raz, R.458 617 
Razborov, A. A457 459 617 
reachability, 

algorithm, 

paths explored by 344 

reachable vertex counting program 345 

problem 338 
readll5 

once computation 580 
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real numbers, 

functions using 12 
reciprocal, 

algorithm 70 
division and 68 
function, 

circuit for 72 

reduction of squaring to 73 
integer 68 

reduction from 72(*) 
Reckhow, R. A.323 389 608 
recognition, 
language 215 
byFSM 154 
(chapter) 153 
DFSM 154 
NFSM 154 
TM 119 
regular, 

expressions, by FSM 160(*), 161 
languages 184(*) 
languages, byFSM 185 
records, 

activation 339 
complete 339 
rectilinearity, 

VLSI wire layouts 578 
recurrence, 

first-order linear, of length n 86 
recursion, 

decomposition, of set of strings 166 
enumerable language, as Chomsky hierarchy 

component 5 
language, DTM 333 
partial recursive functions 231 232(*) 

RAM computability of 233(*) 
primitive recursive functions 231 (*) 
standard TM 210 
recursively enumerable languages223 
are phrase-structure 219 
but not decidable 226 228 
Chomsky hierarchy component 5 
decidable 225(*) 

phrase-structure languages are 220 
standard TM 210 
red pebble game, 

See, pebble game 
Red'kin, N. P.88 455 457 617 
red-blue pebble game, 
See also, pebble game 



red-blue pebble game (cont.) 

I/O time bounds for matrix multiplication 

in 542 
on FFT graph, computation and I/O time 

lower bounds 547 
playing 532(*) 
rules and strategies 530(*) 
reducibilhy226(*) 

classifying languages as unsolvable using 227 
unsolvability and 226(*) 
reduction348(*) 

between logical and cyclic shifting functions 

51 
CIRCUIT SAT language to NAESATlanguage 

357 
from Turing to circuit computations 128(*) 
function 46(*) 
I/O time bounds 536 
integer reciprocal 72(*) 
log-space 131 

logical and cyclic shifting 50(*) 
many-to-one 227 348 
multiplication 68(*) 
NP-complete languages 132(*) 
P-complete languages 130(*) 
polynomial time 132 
problem-solving method 35 
of squaring to reciprocal function, reduction 

of squaring to 73 
subfunction relationship 46 
to complete problems 129 
Turing 348 385 
refinement, 

equivalence relation 173 
on states 175 
reflexive, 

relation 10 
register(s) 109 

pebble game relationship to 6 
set 138(*) 
simple CPU 138 
transfer notation 142 
regular, 

expressions 158(*) 
equivalence of 159 
FSMandl60(*) 

FSM languages described by 164(*) 
NFSM recognition of 160 161 162 163 
properties of 1 59 
recognition by FSM 160(*) 
string search use 168(*) 
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regular (cont.) 

grammar 184 
languages 22 158 184(*) 

as Chomsky hierarchy component 5 

as Chomsky language type 182 

closure properties 170 

conditions for 174(*) 

conditions for finite and infinite 169 

decision problems on, algorithms 171 

machine type that corresponds to, (table) 

182 
properties of 170(*) 
pumping lemma 169 
recognition 184(*) 
regular language acceptance 185 
machine recognition problem 229 
machine recognition problem 229 
set 158 
Reif, J. H.72 88 323 610 617 618 619 
Reischuk, R526 618 
reject state 179 
relations9(*) 

equivalence 10 172 
DFSM 172 
for a language 1 72 
on languages 171 (*) 
on states 17 1(*) 
right-invariant 172 173 
reflexive 10 
symmetric 10 
transitive 10 
Rem, M.601 615 
replacement, 

function replacement method, monotone 
circuit lower bounds 
derivation 417 (*) 
rules 417 

monotone functions 418 
semi-disjoint bilinear form 420 
representation , 
integers 8 

(k, s)-Lupanov80 81 82 
restricted models of computation 217(*) 
standard, 
binary 8 
decimal 8 
reset, flip-flop 109 
resource, 

bounds 330(*) 

transformations 348 



resource (cont.) 

vector 534 
rewriting strings 1 8 1 
Rice, H. G.236 618 
Rice's Theorem229 230 
rich path499 

right-invariant equivalence relation 172 173 
rings239(*) 

commutative 264(*) 

linear arrays 292 

matrix multiplication 242 245 

near 86 

semirings 251 
Riordan,J.89 618 
ripple adder58 107 

RISC (reduced instruction set computer) 138 
root(s), 

of unity, in commutative rings 264 

rooted directed acyclic multigraph 489 

vertex 489 
Rosenberg, A. L.603 618 
routing309 

networks 309(*),310(*) 

permutation problem 309 
row-major order, 

meshes 293 
RSE (ring-sum expansion)45(*) 
Ruane, L. M.602 613 
rules, 

absorption, in Boolean expressions 41 

DeMorgan's, in Boolean expressions 41 

replacement 417 

semi-disjoint bilinear form 420 
Ruzzo.W. L.389 390 610 



5-span, 

DAG 537 

matrix multiplication 541 
safe, 

circuit 107 
Sahay, A.323 609 
Sahni, S.323 609 
Santos, E. E.323 609 

SATISFIABILITY languagel32 133 328 356 
satisfiable328 
Savage, C.602 618 
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Savage, J. E.89 152 389 455 456 457 526 527 
528 573 602 603 608 610 611 613 
618 619 
Savitch,W.J.323 389 618 
Savitch's theorem339 340 
Saxe,J.459 610 
CIRCUIT SAT language355 
scalar product, 

matrix 240 
Schafer, T.J.389 618 
Schauser, K. E.323 609 
Schmidt, E.M.458 615 
Schmidt, H.A.6 11 
Schnorr, C. P.152 455 619 
Schonhage,A.67 88 619 
Schonhage-Strassen circuit67 
Schur, 
Schiirfeld,U.619 

complement 254 

factorization 254(*) 
Schutte, K.611 

Schutzenberger, M. P.207 619 
Scott, D.l 52 207 617 
search, 

binary 565 
tree 564 
Sedgewick, R.601 602 614 
self-terminating machine problem230 
semantics, 

programming language, brief history 5 
semellective computation580 
semi-disjoint bilinear form, 

replacement rule 420 
semi-disjoint function, 

circuit size lower bound 421 
semigroup 5 6 
semirings251 
separator theorem, 

for trees 397 

planar 589(*), 591 
conditional 590 
multi-set 592 
two-cost 592 600 
sequences, 

bitonic 278 
sequential, 

circuits 106 

as concrete implementation of sequential 

machine model 5 
constructing from a FSM 92 



sequential (cont.) 
circuits (cont.) 

designing 106(*) 
machine, sequential circuit as concrete 
implementation of 5 
serial, 

computation thesis 330 
computational models 33 1 (*) 
branching program 488(*) 
space, parallel time relationship to 379 
series, 

expansion, Taylor 73 
set(s)7 

binary relation over 9 
cardinality 7 
characteristics of 7(*) 
difference 7 
disjoint 7 
final states, 
DFSM 154 
PDA 177 
flip-flop 109 

instruction, simple CPU 140(*) 
intersection 7 
matrix over 240 
membership notation 7 
neighborhood 408 
power 8 
properties 8 
regular 158 

strings, concatenation 9 
symmetric difference 234 
totally ordered 270 
union 7 
Sethi, R.527 605 608 
shallow, 
circuits, 

simulating addition with 105(*) 
simulating FSM with 100(*) 
Shamir, E.207 606 
Shannon, C. E.88 89 618 619 

contributions to theoretical computer 
science 4 
shared memory computer284 
Shepherdson, J. C.389 619 
shifting, 

circuits, cyclic 49 
cyclic 474(*) 
function 474 

functions, independence properties 474 
functions, space— time lower bound 475 
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shifting (cont.) 
cyclic (cont.) 

on the hypercube 303(*), 304 
reductions between logical shifting 
and 50 (*) 
functions 48(*) 
cyclic 48 

cyclic, circuits for 50 
logical, 

reduction to multiplication 68 
reductions between cyclic shifting 
and 50 (*) 
Shriver, E. A.M.573 621 
shuffle permutations304 
on linear arrays 304(*) 
Siegel,A.603 619 
signed two's complement6 1 
SIMD (single instruction, multiple data) 

mode!285 
simulation 

branching program 49 1 
circuit, 

by dataflow computers 283 
of FSM 95 

ofTM 124(*), 134(*) 
CPU by another CPU 1470*) 
CRCW PRAM, by EREW PRAM 314 
CREW PRAM, by circuits 377 
FSM, by shallow circuits 102 104 
of 2D array on ID array 298 
of fast memory in the MHG 5 5 8 (*) 
of normal algorithm, PRAM EREW 313 
PRAM, 

by hypercube network 315(*) 
of trees, arrays, and hypercubes 313(*) 
by precise TM 334 
RAM, 

branching program 495 
by DTM 332 
byTM 122 
TM, single-tape simulation of multi-tape 
213 
sink vertex489 

Sipser, M.89 456 459 602 607 610 616 
SISD (single instruction, single data) 

model285 
size, 

circuit 11 35 239 

as quantity whose rate of growth is 

significant 13 
basis change effect on 396 



size (cont.) 
circuit (cont.) 

bounds on 402 
fan-out impact on 394(*) 
gate-elimination method for 400(*) 
in a simple CPU 146(*) 
monotone, clique function 430 
planar 586(*) 

simple lower bounds on 400 
slice function relationship 432 
upper bounds on 79(*) 
with fan-out S 393 
exponential, bounded-depth parity circuits 

450 
formula 394 
bounds on 397 
circuit depth vs 396(*) 
fan-out-1 relationship 394 
lower bounds for 404(*) 
over two different bases 399 
monotone circuits, slice functions 434 
planar circuits, relationship between AT 

and A 2 T and 589 
polynomial, circuits of 382 
speed tradeoffs, 
(chapter) 46 1(*) 
in memory hierarchies 529(*) 
Skyum, S.457 619 
Sleator, D. D.573 619 
slice functions, 
central slice 435 
circuit size relationship 432 
HALF-CLIQUE CENTRAL SLICE, 
function 435 
language 435 
monotone circuits 43 1 (*) 
NP-complete 435 
representation 43 1 
sliding, 

red-blue pebble game 462 530 
Smith, C.H.I 52 619 
Smolensky, R.459 619 
snake row ordering297 316 
Snir, M.563 573 605 
solvable task2 1 
solving, 

linear systems 262(*) 
Song, S.W.602 613 

SOPE (sum-of-products expansion)44(*) 
sorting, 

algorithm 301 302 
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sorting (cont.) 

binary 85 

functions as symmetric function 74 

monotone circuits lower bounds 413 
bitonic271 272 278 
as Borodin-Cook lower-bound method, 

application 516(*) 
BTM561 
bubble sort 294 
comparison-based, lower performance 

bounds 565 
linear arrays 294(*) 

longest path length, for binary tree 560 
networks 270(*) 

AKS 274 

fast 274(*) 

insertion 270 
odd-even transposition 294 
problem 270 
radix sort 286 

space-time lower bounds 516 
stable sorting algorithm 304 
via bitonic merging 27 1(*) 
space, 
bounded, 

complexity classes 338(*) 

complexity classes, time-bounded 

complexity class relationships with 

34 1(*) 

functions 342(*) 
branching program 490 
deterministic, nondeterministic time 

contained in 341 
hierarchy 336(*) 
I/O time tradeoffs 539(*) 

convolution 552(*) 

FFT 546 (*) 

matrix-matrix multiplication 54 1 (*) 

vector-matrix product 539(*) 
jump value for 483 
log-space, 

contained in polynomial-time 342 

reduction 131 
lower bounds 465 (*) 

pebble game 470(*) 
MHG 534 
minimum 465 

existence of graph requiring large 488 
nondeterministic space classes closed under 

complements 346 
OTM217 333 



space (cont.) 

pebbling strategy 531 

quantity whose rate of growth is significant 

13 
RAM 332 495 

serial, parallel time relationship to 379 
time, 

and I/O tradeoffs 24(*) 

bounds on MHG 544 

lower bound, cyclic shifting functions 475 

lower bound, DFT 480 513 

lower bound, integer multiplication 507 

lower bound, matrix inversion 512 

lower bound, matrix multiplication 511 

lower bound, matrix-vector product 508 

lower bound, merging networks 482 

lower bound, sorting 516 

lower bound, unique elements 516 

lower bound, wrapped convolution 505 

product for branching programs 500 

product, matrix multiplication 472(*) 

product (ST) 118 

tradeoffs, (chapter) 46 1(*) 

tradeoffs in memory hierarchies, (chapter) 

529(*) 
tradeoffs, matrix multiplication 479 
tradeoffs, pebble game study of, brief 
history 6 
TM333 

upper bounds 483(*) 
SPACE(r(w))334 

NTIME(r(ro)) relationship with 341 
space hierarchy 337 
spanning tree589 

BFS 591 
SPD (symmetric positive definite) 
matrices253(*), 253 
inversion, reduction of matrix inversion to 

254 
LDL T factorization of 257(*), 257 258 259 
Schur complement of SPD matrix is SPD 
255 
Specker, E.456 611 
speedup, 

PRAM 290 
Spira,P. M.455 619 
Spirakis, P.323 607 610 
Spivak, M.72 619 
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SPMD (single program multiple data) 
model, 

data parallel model implementation by 287 
Sproull, B.606 612 613 617 618 
square, 

matrix 240 
squaring function68 
ST (space-time product) 118 
stable sorting algorithm304 
stack, 

alphabet, PDA 177 
PDA 177 
stacking state 178 
standard, 
basis 373 392 

of a logic circuit 38 
representation, 
binary 8 
decimal 8 
start, 

symbols 22 
state(s)30 
accept 179 

assignment, problem 107 
branching program 495 
diagram 18 30 

FSM21 
equivalence 175 
relations on 17 1(*) 
relations refinement on 175 
final, DFSM 154 
initial, 

DTM 119 
NDTM 120 
minimization 171 (*) 

problem 158 
?i-indistinguishable 175 
next, 

DTM 119 
NDTM 120 
next-state/output RAM functions, circuits 

for 120 
possible accept 1 79 
reject 179 
set of, 

DFSM 154 
DTM 119 
NDTM 120 
stacking 178 
to-state mappings 101 
Stearns, R. E.336 389 610 611 



Steele, G.323 606 61 1 612 613 617 618 
step, 

basis 15 
Stewart, G.W.6 13 
Stimson, M.J.323 618 
Stockmeyer, L. J.389 390 619 
Stone, H.S.323 619 
storage, 

access function 54 

capacity 111 
TM 119 
sto red-program concept 1 10 
straight-line program(s)17 35 238(*) 

algorithms, lower performance bounds 565 

Boolean, circuit as graph of 37 

branching programs vs. 496(*) 

circuits, 
and 36(*) 
representation of 37 238 

functions computed by 38 

realizing subfunction of a function 47 
Strassen, V.67 88 245 278 618 619 
Strassen's algorithm245(*) 

matrix multiplication 247 
strategy, 

adversarial 443 445 447 

pebbling 531 558 
strict refinement 173 
string(s)9 

acceptance 92 
byFSM 154 
DFSM 154 
DTM 119 
NFSM 154 

choice input, acceptance by NDTM 120 

concatenation 158 

empty 9 

encoding of, TM and 222(*) 

languages and 9(*) 

relation to alphabets 9 

searching for, with grep 168(*) 

sets of, concatenation 9 
Sturgis, H. E.389 619 
Subbotovskaya, B. A.456 619 
subfunctions, 

realizing, of a function A7 

relationship, reduction via 46 
Subramonian, R.323 609 
subset(s)7 
SUBSET SUM language361 
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substitution, 

backward 263 

constants, in Boolean expressions 41 
subtraction61(*) 

function, proper 232 
successor function23 1 
succinct, 

certificate 100 
sum44 

Boolean function AA 
summing, 

on the hypercube 302(*) 

operations 48 
superconcentrator485 486 
superfluous query499 
superpolynomial function330 
Swamy, S.526 527 618 619 
symbol(s), 

non-terminal 22 

start 22 

terminal 22 
symmetric, 

difference, between sets 234 

elementary, functions as symmetric function 
74 

functions 74(*) 
circuits for 76 

matrix 240 

positive definite matrices, See SPD 

relation 10 
synchronous, 

FSM97 

circuit simulating 98 

model, VLSI 579 

parallel computers 285 
systems, 

balanced computer 532(*) 

number 8(*) 
systolic array27 28 292 
Szelepscenyi, R.389 620 
Szemeredi, E.274 456 457 606 



table lookup493 
Tanaka, K.456 606 620 
tape, 

alphabet 214 
DTM 119 
NDTM 120 



tape (cont.) 

alphabet (cont.) 

PDA 177 
standard TM 210 
empty, acceptance problem 228 
enumeration 215 
head 20 
multi, TM 119 
one, TM 118 
oracle 216 333 
PDA, blank symbol 177 
TM20 118 149 210 213H 
Tardos, E.457 620 
Tarjan, R, E.482 526 528 573 602 610 614 

616619 
TASK SEQUENCING language361 
Tate, S. R.72 88 618 620 
taxonomy, 

Flynn's, parallel computers 285 
Taylor series expansion72 73 
terminal(s) , 

phrase-structure grammar 182 
symbols 22 
termination, 

abnormal 210 
terms, 

dominant, 

as rate of grown indicator 1 3 
big Oh notation 1 3 
big Omega notation 13 
big Theta notation 13 
test cases, 

approximation method 425 
Thatcher, J.6 12 
theorems, 

2-SAT language in NL, (8.11.1) 363 
(2.6.1), theorem 3.2.1 as restatement of 102 
3-COLORING is NP-complete, 

(8.10.4)359 
3-SATis NP-complete, (8.10.1) 356 
addition function, circuit for (2.7.1) 60 
algorithms, 

Csanky's, for matrix inversion, (6.5.6) 262 
decision problems on regular languages, 

(4.6.2) 171 
fast, convolution complexity of, (6.7.3) 

270 
FFT, (6.7.1)267 

FFT-based for convolution, (6.7.2) 269 
fully normal ascending/descending, 
(7.7.4) 306 
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theorems (cont.) 
algorithms (cont.) 

hypercube sorting, (7.7.1) 302 
LDL factorization of SPD matrices, 

(6.5.3) 259 
matrix inversion, (6.5.4) 260 
minimal FSM, (4.7.2) 176 
normal, EREW PRAM simulation, 

(7.9.1)313 
normal, on 2D array, (7.7.5) 307 
normal, on CCC network, (7.7.6) 308 
polynomial time, CFL recognition, 

(4.11.2) 189 

Strassen's matrix multiplication, (6.3.1) 
247 
ALTERNATING QUANTIFIED 

SATISFIABILITY language, 
log-space complete for PSPACE, (8.12.2) 
369 
Amdahl's law, (7.4.2) 290 
area lower bound, for matrix multiplication, 

(12.8.3) 598 

AT and A"T lower bounds and 

communcation complexity, 

(12.7.4) 596 

AT' and A T lower bounds for 

independent functions, (12.7.2) 

593 
AT and A T lower bounds for matrix 

multiplication, (12.7.3) 595 
AT upper bound for normal algorithms, 

(12.5.1) 585 
basic pebble-game lower bound method, 

(10.4.1) 470 

Batcher's bitonic merging network, (6.8.2) 
273 
complexity of, (6.8.3) 273 

binomial 451 

Boolean, 

convolution circuit size, (9.6.3) 419 
functions, circuit-size upper bound for all, 

(2.13.2) 82 

functions, depth lower bound for most, 

(2.12.2)79 
functions, depth upper bound for all, 

(2.13.1)80 
functions, negations needed to realize, 

(9.5.1)409 
matrix multiplication optimal monotone 

circuit, (9.6.5) 424 



theorems (cont.) 

bounded-depth parity circuits, have 

exponential size, (9.7.4) 450 
bounded-fan-out circuits, (9.2.1) 395 
branching program, basic, lower bound 

method, (10.11.1)498 
Brent's principle, (7.4.3) 291 
broadcasting on the hypercube, (7.7.2) 303 
BTM sorting time, (1 1.8.1) 561 
bubble sort, on linear array, (7.5.2) 294 
carry lookahead adder, circuit for (2.7.2) 61 
carry-save adder, circuit for (2.9.1) 64 
Cayley-Hamilton 260 
CFL, 

Chomsky normal form for, (4.1 1.1) 187 
closure properties, (4.13.1) 198 
non-closure properties, (4.13.2) 199 
PDA acceptance of, (4.12.1) 192 
recognition, polynomial time algorithm, 
(4.11.2) 189 
chip area, 

lower bound in terms of w(u, w)-fiow, 

(12.8.1)597 
lower bounds for independent functions, 
(12.8.2) 597 
Chomsky normal form, for CFLs, (4.11.1) 

187 
circuit(s), 

Boolean convolution, size, (9.6.3) 419 
bounded-depth parity, have exponential 

size, (9.7.4) 450 
bounded-fan-out, (9.2.1) 395 
complexity classes, containment of, 

(8.15.1)381 
complexity classes, P/poly, (8.15.2) 383 
computations, equivalence between FSM 

and, (3.12) 96 
CREW PRAM equivalence, (8.14.1) 379 
depth, relationship between formula size 

and, (9.2.2) 397 
for addition function, (2.7.1) 60 
for carry lookahead adder. (2.7.2) 61 
for carry-save adder, (2.9.1) 64 
for carry-save multiplier, (2.9.2) 66 
for divide-and-conquer multiplier, (2.9.3) 

67 
for parallel prefix function, (2.6.1) 57 
for reciprocal function, (2.10.1) 72 
for symmetric functions, (2.11.1) 76 
for transitive closure, (6.4.3) 251 
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theorems (cont.) 
circuit(s) (cont.) 

log-space uniform, polytime DTM 

functions are computable by, 

(8.13.1)374 
MONOTONE CIRCUIT VALUElanguage, 

is P-complete, (8.9.1) 353 
monotone, lower bound for binary 

sorting, (9.6.1)413 
monotone, lower bound for merging, 

(9.6.2)414 
monotone optimal, Boolean matrix 

multiplication, (9.6.5) 424 
monotone, realization of 

pseudo-negations by, (9.6.8) 432 
monotone, size of clique function, (9.6.6) 

430 
planar, relationship between size and AT' 

and A 2 T, (12.6.1)589 
planar, size lower bound in terms of 

io(u,!i)-flow, (12.7.1) 593 
semi-disjoint function, size lower bound, 

(9.6.4) 421 

shallow circuit simulation of FSM, (3.2.1) 

102 
shallow circuit simulation of FSM, (3.2.2) 

104 
simulation of TM (3.9.1) 125 
size, and depth, simple lower bounds on 

(9.3.1)400 
size, lower bound for most Boolean 

functions, (2.12.1) 77 
size lower bounds for functions in Fa 

(9.3.3) 403 
size lower bounds for functions in Qj 3 . 

(9.3.2) 401 
size, of function and its slices, (9.6.7) 432 
size, slice functions have comparable 

monotone and non-montone, 

(9.6.9) 434 
size, upper bound for all Boolean 

functions, (2.13.2) 82 
uniform circuit family, NSPACE language 

recognition by, (8.13.3) 375 
CIRCUIT SAT language, (3.9.6) 132 
CIRCUIT VALUE language, P-complete 

(3.9.5) 131 
communication complexity, 

AT and A'T lower bounds and, 

(12.7.4) 596 
equals depth, (9.7.1)438 



? (n,k) 
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theorems (cont.) 

communication complexity (cont.) 

monotone, equals depth of monotone 

functions, (9.7.2)441 
of clique function, (9.7.3) 447 
competitive analysis of FIFO and LRU, 

(11.10.1) 568 
complexity, 

Batcher's bitonic merging network, (6.8.3) 

273 
classes, circuit, containment of, (8.15.1) 

381 
classes, circuit, P/poly, (8.15.2) 383 
classes, containment among 

time-bounded, (8.5.4) 337 
classes, time and space-bounded, 

relationship between, (8.5.6) 341 
communication, AT and A T lower 

bounds and, (12.7.4) 596 
communication, equals depth, (9.7.1) 438 
communication, of clique function, 

(9.7.3) 447 
convolution, of fast algorithm, (6.7.3) 270 
monotone communication, equals depth 

of monotone functions, (9.7.2) 441 
of transitive closure function, (6.4.1) 249 
computing Boolean functions on CRCW 

PRAM, (7.9.2) 314 
containment among time-bounded 

complexity classes, (8.5.4) 337 
containment of, 

circuit complexity classes, (8.15.1) 381 
deterministic classes and their 

complements, (8.6.1) 343 
convolution 268(*) 

Boolean, circuit size, (9.6.3) 419 
complexity of fast algorithm, (6.7.3) 270 
FFT-based algorithm, (6.7.2) 269 
in I/O-limited MHG, I/O time bounds, 

(11.5.8)553 
in MHG, I/O time bounds (11.5.7) 553 
wrapped, space-time lower bound, 

(10.13.1)505 
wrapped, space-time lower bound, 

(10.5.1)474 
CREW PRAM and circuit equivalence, 

(8.14.1)379 
cyclic, 

shifting on hypercubes, (7.7.3) 304 
shifting space— time lower bound, (10.5.2) 

475 
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decidable languages, 

(5.7.1) 224 

(5.7.2) 224 

(5.7.3) 224 

complement is decidable, (5.7.5) 225 
DFT, 

space-time lower bound, (10.13.7) 513 
space-time lower bound, (10.5.5) 480 
DTM ACCEPTANCEis P-complete, (8.9.3) 

354 
empty, 

set acceptance problem, (5.8.3) 229 
tape acceptance problem, (5.8.2) 228 
EREW PRAM simulation, 

by hypercube network, (7.9.4) 317 
ofCRCWPRAM, (7.9.3) 314 
of normal algorithm, (7.9.1) 313 
EXACT COVERis NP-complete, (8.10.5) 360 
existence of graph requiring large minimum 

space, (10.8.1)488 
FFT algorithm, (6.7.1) 267 
formula size, 

and circuit depth, relationship between, 

(9.2.2) 397 
over two different bases, (9.2.3) 399 
FSM, 

computational inequalities for 

interconnected, (3.1.3) 97 
equivalence between circuit computations 

and, (3.12) 96 
function computed by, (3.1.1) 95 
languages are regular, (4.10.1) 185 
minimal, algorithm for, (4.7.2) 176 
shallow circuit simulation of, (3.2.1) 102 
shallow circuit simulation of, (3.2.2) 104 
gap, (8.5.3) 337 

GENERALIZED GEOGRAPHY language, 
log-space complete for PSPACE, (8.12.3) 
370 
HALF-CLIQUE CENTRAL SLICE language is 

NP-complete, (9.6.10)435 
halting problem, unsolvablity, (5.8.1) 228 
HMM, cost of problems in, (11.9.1) 565 
Hong-Kung lower-bound method, (1 1.4.1) 

537 
I/O bounds, matrix-vector product, (11.5.1) 

539 
I/O time bounds, 

for convolution in I/O-limited MHG, 
(11.5.8)553 



theorems (cont.) 

I/O time bounds (cont.) 

for convolution in MHG, (11.5.7 ) 553 
for FFT in I/O-limited MHG, (11.5.6) 

551 
for FFT in MHG, (11.5.5)549 
for FFT in red-blue pebble game, (11.5.4) 

547 
for matrix multiplication in MHG, 

(11.5.3)544 
for matrix multiplication in red-blue 

pebble game, (11.5.2) 542 
Immerman-Szelepscenyi, (8.6.2) 344 
impossibility 96 
(3.1.1)95 

for bounded computations 24 
INDEPENDENT SET is NP-complete, 

(8.10.3)357 
integer, 

multiplication, space-time lower bound, 

(10.5.3)475 
multiplication space— time lower bound, 

(10.13.2) 507 
INTEGER PROGRAMMING, is 

NP-complete, (8.10.5) 362 
justification for P-complete problems, 

(8.14.2)380 
Krapchenko lower bound, (9.4.2) 408 
languages, 

2-SAT, in NL (8.11.1) 363 

accepted by NDTM accepted by DTM, 

(5.2.2)215 
accepted by NFSMs and DFSMs are 

same, (4.2.1) 156 
accepted by PDAs are context-free, 

(4.12.2) 194 
ALTERNATING QUANTIFIED 

SATISFIABILITY, log-space 

complete for PSPACE, (8.12.2) 369 
CIRCUIT SAT, (3.9.6) 132 
condition for P = NP, (3.9.4) 130 
decidable, (5.7.1) 224 
decidable, (5.7.2) 224 
decidable, (5.7.3) 224 
decidable, complement is decidable, 

(5.7.5) 225 
FSM, described by regular expressions, 

(4.4.2) 164 

GENERALIZED GEOGRAPHY, log-space 
complete for PSPACE, (8.12.3) 
370 
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theorems (cont.) 
languages (cont.) 

HALF-CLIQUE CENTRAL SLICE, is 

NP-complete, (9.6.10) 435 
MONOTONE CIRCUIT VALUE is 

P-complete, (8.9.1) 353 
non-recursively enumerable, (5.7.4) 224 
NSPACE, recognition by uniform circuit 

family, (8.13.3)375 
P-complete, CIRCUIT VALUE (3.9.5) 131 
phrase-structure languages are recursively 

enumerable, (5.4.2) 220 
phrase-structure, recursively enumerable 

languages are, (5.4.1) 219 
QUANTIFIED SATISFIABILITY, log-space 

complete for PSPACE, (8.12.1) 369 
recursively enumerable, are 

phrase-structure, (5.4.1) 219 
recursively enumerable, but not decidable, 

(5.7.6) 226 
recursively enumerable, but not decidable, 

(5.8.1)228 
recursively enumerable, but not decidable, 

(5.8.6) 230 
recursively enumerable, phrase-structure 

languages are, (5.4.2) 220 
regular, closure properties, (4.6.1) 170 
regular, conditions for, (4.7.1) 174(*) 
regular, decision problems on, (4.6.2) 171 
regular, FSM recognition of, (4.10.1) 185 
SATISFIABILITY, (3.9.7) 133 

undecidable, (5.8.5) 230 
undecidable, example, (5.8.2) 228 
undecidable, example, (5.8.3) 229 
undecidable, example, (5.8.4) 229 

LDL factorization of SPD matrices, 
(6.5.2) 257 
algorithm for, (6.5.3) 259 

Leverrier, (6.5.5) 261 

linear, equation solutions, (6.6.1) 263 

LINEAR INEQUALITIES, is P-hard, (8.9.2) 
353 

log-space, 

contained in polynomial-time, (8.5.8) 342 
uniform circuits computable by polytime 
DTMs, (8.13.2) 374 

matrix, 

inversion algorithm, (6.5.4) 260 
inversion for triangular matrices, (6.5.1) 
256 



theorems (cont.) 
matrix (cont.) 

inversion, space— time lower bound, 

(10.13.6)512 
multiplication as ring, (6.2.1) 242 
multiplication on a hypercube, (7.7.7) 

308 
multiplication on linear array, (7.5.1) 294 
multiplication, space— time lower bound, 

(10.13.4)511 
multiplication, space— time lower bound, 

(10.4.2) 472 
multiplication, space-time lower bound, 

(10.5.4) 479 
SPD, LDL T factorization of, (6.5.2) 257 
SPD, LDL factorization of, algorithm 

for, (6.5.3) 259 
vector product, space-time lower bound, 
(10.13.3)508 
merging network space— time lower bound, 

(10.5.6)482 
monotone circuit, 

lower bound for binary sorting, (9.6.1) 

413 
lower bound for merging, (9.6.2) 414 
size of clique function, (9.6.6) 430 
MONOTONE CIRCUIT VALUE language, is 

P-complete, (8.9.1)353 
monotone communication complexity 
equals depth of monotone 
functions, (9.7.2) 441 
multiplication matrix on 2D array, (7.5.3) 

296 
Myhill-Nerode, (4.7.1) 174(*) 
NAESATis NP-complete, (8.10.2) 356 
Neciporuk lower bounds, (9.4.1) 406 
nondeterministic time contained in 

deterministic space, (8.5.7) 341 
normal algorithm on CCC network, (7.7.6) 

308 
normal algorithms, normal, AT upper 

bound for, (12.5.1) 585 
NSPACE language recognition by uniform 

circuit family, (8.13.3) 375 
P and NP complete problems, justification 

for, (8.8.2) 352 
P-complete problems, justification for, 

(8.14.2)380 
P/poly circuit complexity class, (8.15.2) 383 
parallel prefix function, circuit for, 
(2.6.1) 57 
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planar circuit, size lower bound in terms of 

w(u,t/)-flow, (12.7.1) 593 
planar separator 589(*), 590 591 
(12.6.2) 591 
two-cost 600 
polynomial time algorithm, CFL 

recognition, (4.11.2) 189 
polytime DTM functions are computable by 
log-space uniform circuits, (8.13.1) 
374 
primality, 

is in intersection of NP and coNP, (8.6.4) 

348 
test, (8.6.3) 347 
processor-time tradeoff, (7.4.1) 290 
QUANTIFIED SATISFIABILITY language, 
log-space complete for PSPACE, (8.12.1) 
369 
RAM, 

computational inequalities for, (3.6.1) 118 
simulation by DTM, (8.4.1) 332 
simulation by TM, (3.8.1) 122 
universal, for FSMs, (3.4.1) 1 14 
realization of pseudo-negations by monotone 

circuits, (9.6.8) 432 
reciprocal function, circuit for (2.10.1) 72 
recursively enumerable but non-decidable 

language, (5.8.6) 230 
reduction, 

from matrix multiplication to transitive 

closure, (6.4.2) 250 
to complete problems, (3.9.3) 129 
regular expressions, 

FSM languages described by (4.4.2) 164 
NFSM recognition of (4.4.1) 160 
properties of (.4.3.1) 159 
regular languages, 

closure properties, (4.6.1) 170 
conditions for, (4.7.1) 174(*) 
decision problems on, (4.6.2) 171 
FSM recognition of, (4.10.1) 185 
regular machine recognition problem, 

(5.8.4) 229 
relationship between planar circuit size and 

AT 1 and A 2 T, (12.6.1)589 
relationships between I/O time bounds, 

(11.3.1)535 
Rice's Theorem, (5.8.5) 230 
SATISFIABILITY language, (3.9.7) 133 
Savitch's theorem, (8.5.5) 339 



theorems (cont.) 

semi-disjoint function circuit size lower 

bound, (9.6.4)421 
separator theorem for trees 397 (9.2.1) 397 
simple lower bounds on circuit size and 

depth, (9.3.1)400 
simulation, 

by precise TM, (8.4.2) 334 
of 2D array on ID array, (7.5.4) 298 
slice functions have comparable monotone 
and non-montone circuit sizes, 
(9.6.9) 434 
sorting algorithm, 

hypercube, (7.7.1) 302 
sorting space-time lower bounds, (10.13.9) 

516 
space, 

hierarchy, (8.5.2) 337 
upper bounds to pebble graph, (10.7.1) 
483 
space-time, extreme tradeoffs, (10.3.1) 467 
space— time, lower bounds unique elements, 

(10.13.8) 516 
SPD matrices, 

LDL factorization of, (6.5.2) 257 
LDL factorization of, algorithm for, 
(6.5.3) 259 
Strassen's matrix multiplication algorithm, 

(6.3.1)247 
SUBSET SUM is NP-complete, (8.10.5) 361 
TASK SEQUENCING is NP-complete, 

(8.10.5)361 
Taylor's, (2.10.2)72 
three-matrix product space— time lower 

bound, (10.13.5)512 
time, 

and space-bounded complexity classes, 
relationship between, (8.5.6) 341 
bounded complexity classes, containment 

among, (8.5.4) 337 
hierarchy theorem, (8.5.1) 336 
TM, 

computational inequalities (3.9.2) 127 
computational inequalities, (3.9.8) 134 
single-tape simulation of multi-tape, 
(5.2.1)213 
transitivity of log-space transformations, 
(8.8.1)350 



©John E Savage 



INDEX 



669 



theorems (cont.) 

unique elements space— time lower bounds, 

(10.13.8)516 
unsolvable problems, 

empty set acceptance, (5.8.3) 229 
empty tape acceptance, (5.8.2) 228 
halting, (5.8.1)228 
Rice's Theorem, (5.8.5) 230 
self-terminating machine, (5.8.6) 230 
wrapped convolution, 

space-time lower bound, (10.13.1) 505 
space-time lower bound, (10.5.1) 474 
zero-one principle, (6.8.1) 271 
theory, 

computer science role, overview, 
(chapter) 3 (*) 
thesis, 

Church-Turing 209 
computation, 
parallel 379(*) 
serial 330 
Thiele,J.-J.611 

Thompson, C. D.601 602 610 620 
three-matrix product, 
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